:= could warn/error when RHS is ambiguous #3077

MichaelChirico · 2018-09-27T01:03:40Z

Was scraping the table here:

https://en.wikipedia.org/wiki/List_of_nominations_to_the_Supreme_Court_of_the_United_States

And didn't realize the output had duplicated column names:

SC = data.table(
  Nominee = "Name", 
  Nominee = "Age", 
  `Nominated by` = "President", 
  `Nominated by` = "Party", 
  Succession = "(Seat)   Justice", 
  Nomination = "Majorityparty", 
  Nomination = "Datesubmitted", 
  Nomination = "Outcome (vote)and date", 
  Nomination = "ND", 
)

SC[ , nominated := as.IDate(Nomination, '%b. %d, %Y')]

nominated will attempt to compute the RHS and will select the first Nomination column.

This is easily avoided by being careful about check.names up front, but if it's easy to check/warn on the fly, we should.

The text was updated successfully, but these errors were encountered:

franknarf1 · 2018-09-27T03:48:23Z

(Looking at the title...) Seems like a useful precaution in any j call, not just :=...?

MichaelChirico · 2018-09-27T03:52:04Z

Thinking again, maybe this can be solved up front by adding gateways in setDT, as.data.table.*, fread... if check.names is `FALSE, check the names anyway and warn by default?

Not sure it makes sense to run this code repeatedly in every j call

franknarf1 · 2018-09-27T04:37:10Z

Yeah, a warning upstream like you suggest would be more useful to me, though there are so many points at which data.tables are created, it may be too much of a headache that way, too. (Eg, besides the ones you mention, there's even rbindlist(list(list(x = 1, x = 2, x = 3))))

For me, := isn't special relative to other DT[...] contexts where I might make such a mistake: SC[, table(Nomination)]; SC[, .N, by=Nomination]. I don't feel strongly about it, though and this problem hardly affects me (since I work with predictable data structures these days).

MichaelChirico · 2018-09-27T04:52:19Z

right, since it's such an edge case and fread/setDT already probably cover 95% of data.table creation (fundamental flaw of input data really), I'd be fine letting users sort it out themselves in other cases... the behavior's unexpected only to the extent that you don't understand the table in the first place, which is relatively more likely at the fread/setDT junctures as well

…

On Thu, Sep 27, 2018, 12:37 PM Frank ***@***.***> wrote: Yeah, a warning upstream like you suggest would be more useful to me, though there are so many points at which data.tables are created, it may be too much of a headache that way, too. (Eg, besides the ones you mention, there's even rbindlist(list(list(x = 1, x = 2, x = 3)))) For me, := isn't special relative to other DT[...] contexts where I might make such a mistake: SC[, table(Nomination)]; SC[, .N, by=Nomination]. I don't feel strongly about it, though and this problem hardly affects me (since I work with predictable data structures these days). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3077 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdWrwG0vkj4xE6DNSKA73j9KTbhRWks5ufFX6gaJpZM4W7u90> .

MichaelChirico · 2019-08-24T15:39:26Z

Have a fix for this issue like this:

-        allcols = c(names(x), xdotprefix, names(i), idotprefix)
-        ansvars = setdiff(intersect(av,allcols), bynames)
+        if (anyDuplicated(used_from_x <- dupintersect(names_x, av))) {
+          dupcols = unique(used_from_x[duplicated(used_from_x)])
+          stop("Unable to disambiguate reference to duplicated columns in x: ", brackify(dupcols))
+        }
+        allcols = c(names_x, xdotprefix, names_i, idotprefix)
+        ansvars = sdvars = setdiff(intersect(allcols, av), bynames)

But it breaks tests 1290 where there are some rules laid out for how data.table is supposed to act with duplicates, namely:

########################################
# Extensve testing for "duplicate" names
########################################
# Rules: Basically, if index is directly given in 'j', just those columns are touched/operated on. But if 'column' names are given and there are more than one
# occurrence of that column, then it's hard to decide which to keep and which to remove. So, to remove, all are removed, to keep, always the first is kept.
# 1) when i,j,by are all absent (or) just 'i' is present then ALL duplicate columns are returned.
# 2) When 'with=FALSE' and 'j' is a character and 'notj' is TRUE, all instances of the column to be removed will be removed.
# 3) When 'with=FALSE' and 'j' is a character and 'notj' is FALSE, only the first column will be recognised in presence of duplicate columns.
# 4) When 'with=FALSE' and 'j' is numeric and 'notj' is TRUE, just those indices will be removed.
# 5) When 'with=FALSE' and 'j' is numeric and 'notj' is FALSE, all columns for indices given, if valid, are returned. (FIXES #5688)
# 6) When .SD is in 'j', but '.SDcols'  is not present, ALL columns are subset'd - FIXES BUG #5008.
# 7) When .SD and .SDcols are present and .SDcols is numeric, columns corresponding to the given indices are returned.
# 8) When .SD and .SDcols are present and .SDcols is character, duplicate column names will only return the first column, each time.
# 9) When .SD and .SDcols are present and .SDcols is numeric, and it's -SDcols, then just those columns are removed.
# 10) When .SD and .SDcols are present and .SDcols is character and -SDcols, then all occurrences of that object is removed.
# 11) When no .SD and no .SDcols and no with=FALSE, only duplicate column names will return only the first column each time.
# 12) With 'get("col")', it's the same as with all character types.
# 13) A logical expression in 'j'.
# 14) Finally, no tests but.. using 'by' with duplicate columns and aggregating may not return the intended result, as it may operate on column names in some cases.

Not sure how married we are to the behavior laid out here (a comment in tests hardly seems like documentation)... thoughts @jangorecki / @mattdowle ?

jangorecki · 2019-08-24T16:07:12Z

As for me I would move towards not supporting duplicated column names wherever possible. The only use case I can think about is to produce table for reporting/formatting.

MichaelChirico · 2024-08-29T05:58:09Z

Hmm with hindsight I'd rather not pursue this. Duplicate names can be a pain but they are a fact of life -- I suspect it would cause more pain than help to impose this on user workflows.

E.g. I've recently dealt with this working with {dbplyr} -- I want to run a query like (distilled):

SELECT A.*, B.*
FROM A JOIN B USING (x,y,z)

But A and B might have columns besides x,y,z in common, causing this to fail because {dbplyr} (or maybe tibble/dplyr?) refuse to allow duplicates. The workaround is the much more painful approach of writing out all the column names.

Another example: I received DF and I want to convert to data.table to do some processing. If there are duplicate names in the object I receive and it won't affect me, I'll be annoyed I have to further process DF. If there are duplicates and it will affect me, this change amounts to shuffling the code to deal with duplicates around (from after to before setDT()/as.data.table()) -- seems like busywork.

MichaelChirico mentioned this issue Feb 17, 2019

TODO items from tests.Rraw #2572

Closed

51 tasks

MichaelChirico pushed a commit that referenced this issue Jul 23, 2020

Closes #3077 -- error for duplicate column names

a2f9de9

MichaelChirico mentioned this issue Jul 23, 2020

block duplicates in assignment #4634

Closed

MichaelChirico added the idate/itime label Sep 8, 2020

ColeMiller1 mentioned this issue Feb 9, 2021

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Open

MichaelChirico closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

:= could warn/error when RHS is ambiguous #3077

:= could warn/error when RHS is ambiguous #3077

MichaelChirico commented Sep 27, 2018

franknarf1 commented Sep 27, 2018

MichaelChirico commented Sep 27, 2018

franknarf1 commented Sep 27, 2018

MichaelChirico commented Sep 27, 2018 via email

MichaelChirico commented Aug 24, 2019 •

edited

Loading

jangorecki commented Aug 24, 2019 •

edited

Loading

MichaelChirico commented Aug 29, 2024

:= could warn/error when RHS is ambiguous #3077

:= could warn/error when RHS is ambiguous #3077

Comments

MichaelChirico commented Sep 27, 2018

franknarf1 commented Sep 27, 2018

MichaelChirico commented Sep 27, 2018

franknarf1 commented Sep 27, 2018

MichaelChirico commented Sep 27, 2018 via email

MichaelChirico commented Aug 24, 2019 • edited Loading

jangorecki commented Aug 24, 2019 • edited Loading

MichaelChirico commented Aug 29, 2024

MichaelChirico commented Aug 24, 2019 •

edited

Loading

jangorecki commented Aug 24, 2019 •

edited

Loading