Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Open
MichaelChirico opened this issue Feb 4, 2021 · 9 comments

Comments

@MichaelChirico
Copy link
Member

Observed while answering this SO Question: https://stackoverflow.com/a/66041678/3576984

# NB: recall CJ output is keyed by default
DT1 = CJ(a = 1:3, b = 4:5, c = 6)
# the same table, but column two is re-named a
DT2 = CJ(a = 1:3, a = 4:5, d = 6)

Observe the difference of when DT2 is keyed vs not:

# KEYED
DT1[DT2]
#    a b c i.a
# 1: 1 1 6   4
# 2: 1 1 6   5
# 3: 2 2 6   4
# 4: 2 2 6   5
# 5: 3 3 6   4
# 6: 3 3 6   5

# UNKEYED
setkey(DT2, NULL)
DT1[DT2]
#    a b c
# 1: 1 4 6
# 2: 1 5 6
# 3: 2 4 6
# 4: 2 5 6
# 5: 3 4 6
# 6: 3 5 6

Is there some reason the first case should be intended behavior?

The verbose output suggests it starts doing the right thing, then gets tripped up later on:

DT1[DT2, verbose=TRUE]
# i.a has same type (integer) as x.a. No coercion needed.
# i.a has same type (integer) as x.b. No coercion needed.
# i.d has same type (double) as x.c. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu) 
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
@MichaelChirico
Copy link
Member Author

In fact, it looks more like a bug when we make the second column have a non-matching type:

DT1 = CJ(a = 1:3, b = c('a', 'b'), c = 6)
DT2 = CJ(a = 1:3, a = c('a', 'b'), d = 6, sorted=FALSE)

# works OK
DT1[DT2]

# keyed version fails
setkey(DT2)
DT1[DT2, verbose=TRUE]
# i.a has same type (integer) as x.a. No coercion needed.
#Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  : 
#   Incompatible join types: x.b (character) and i.a (integer)

@magerton
Copy link

magerton commented Feb 5, 2021

I think this might be the same as #4888?

@ColeMiller1
Copy link
Contributor

What is expected behavior with duplicated column names and joins? I would expect an error indicating there are duplicated names in the key.

Source of the issue:

data.table/R/data.table.R

Lines 438 to 445 in 97c96b2

## missing on
rightcols = chmatch(key(x), names_x) # NAs here (i.e. invalid data.table) checked in bmerge()
leftcols = if (haskey(i))
chmatch(head(key(i), length(rightcols)), names(i))
else
seq_len(min(length(i),length(rightcols)))
rightcols = head(rightcols,length(leftcols))
ops = rep(1L, length(leftcols))

Proposed solution:

if (anyDuplicated(key(x)))
  stop("There are duplicated names in the key of X of the X[Y] join. To fix, rename the names with setnames()")
else if (haskey(i) && anyDuplicated(key(i)))
  stop("There are duplicated names in the key of Y of the X[Y] join. To fix, rename the names with setnames().")

A more global proposal would be to do more erroring when data.tables are made with duplicate names. See also #3077.

@MichaelChirico
Copy link
Member Author

I think an error is the right way to go, probably we should error in setkey (and setindex?) as well. Easier than coming up with some ultimately-arbitrary way of dealing with merges on duplicate keys, as I see it.

The fix looks good, my nit is that I would include which columns are duplicated in the error message for user-friendliness.

@ben-schwen
Copy link
Member

If we also include setkey, shouldn't we also consider to add a duplicate check to setnames too?

@MichaelChirico
Copy link
Member Author

I don't see as much of a problem with duplicate names in the table in general as I do for the keys. For example duplicate column names can be essential for formatting output tables; forcing users to create workarounds for this common case seems like an unnecessary burden to me.

If there's really a compelling case to try and block duplicate column names, perhaps we could expose that through an option (e.g. datatable.strict.colnames) that errors when duplicate names are tried anywhere.

@ben-schwen
Copy link
Member

The problem with setnames is that you can use it to alter the names of the keys.

For example

dt = data.table(a = 1:2, b = 3:4)
setkey(dt, "a", "b")
setnames(dt, "b", "a")
key(dt)
[1] "a" "a"

leads you back to the problem of duplicated keys although not directly setting them in the first place.

@MichaelChirico
Copy link
Member Author

I see... I was thinking of the general case of duplicate names in setnames. Actually setnames does some specific checks about key(x), we could check anyDuplicated there without interfering with general use cases of setnames.

@ayushnoori
Copy link

Following up on this thread, recently noticed an issue where, after running setkey on a specific column x and then deleting that column x (i.e., df[, x := NULL]), basic data.table operations yield unexpected results (e.g., df[y == "cat"] does not work for my use case). Is this known/expected behavior?

Might be the same as #4088 hence mentioning here. This may very well be user error - have not had time to generate a MWE. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants