Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

MichaelChirico · 2021-02-04T08:25:24Z

Observed while answering this SO Question: https://stackoverflow.com/a/66041678/3576984

# NB: recall CJ output is keyed by default
DT1 = CJ(a = 1:3, b = 4:5, c = 6)
# the same table, but column two is re-named a
DT2 = CJ(a = 1:3, a = 4:5, d = 6)

Observe the difference of when DT2 is keyed vs not:

# KEYED
DT1[DT2]
#    a b c i.a
# 1: 1 1 6   4
# 2: 1 1 6   5
# 3: 2 2 6   4
# 4: 2 2 6   5
# 5: 3 3 6   4
# 6: 3 3 6   5

# UNKEYED
setkey(DT2, NULL)
DT1[DT2]
#    a b c
# 1: 1 4 6
# 2: 1 5 6
# 3: 2 4 6
# 4: 2 5 6
# 5: 3 4 6
# 6: 3 5 6

Is there some reason the first case should be intended behavior?

The verbose output suggests it starts doing the right thing, then gets tripped up later on:

DT1[DT2, verbose=TRUE]
# i.a has same type (integer) as x.a. No coercion needed.
# i.a has same type (integer) as x.b. No coercion needed.
# i.d has same type (double) as x.c. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu) 
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2021-02-04T08:27:10Z

In fact, it looks more like a bug when we make the second column have a non-matching type:

DT1 = CJ(a = 1:3, b = c('a', 'b'), c = 6)
DT2 = CJ(a = 1:3, a = c('a', 'b'), d = 6, sorted=FALSE)

# works OK
DT1[DT2]

# keyed version fails
setkey(DT2)
DT1[DT2, verbose=TRUE]
# i.a has same type (integer) as x.a. No coercion needed.
#Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  : 
#   Incompatible join types: x.b (character) and i.a (integer)

magerton · 2021-02-05T22:06:29Z

I think this might be the same as #4888?

ColeMiller1 · 2021-02-09T01:50:53Z

What is expected behavior with duplicated column names and joins? I would expect an error indicating there are duplicated names in the key.

Source of the issue:

data.table/R/data.table.R

Lines 438 to 445 in 97c96b2

    
           ## missing on 
        
           rightcols = chmatch(key(x), names_x)   # NAs here (i.e. invalid data.table) checked in bmerge() 
        
           leftcols = if (haskey(i)) 
        
             chmatch(head(key(i), length(rightcols)), names(i)) 
        
           else 
        
             seq_len(min(length(i),length(rightcols))) 
        
           rightcols = head(rightcols,length(leftcols)) 
        
           ops = rep(1L, length(leftcols))

Proposed solution:

if (anyDuplicated(key(x)))
  stop("There are duplicated names in the key of X of the X[Y] join. To fix, rename the names with setnames()")
else if (haskey(i) && anyDuplicated(key(i)))
  stop("There are duplicated names in the key of Y of the X[Y] join. To fix, rename the names with setnames().")

A more global proposal would be to do more erroring when data.tables are made with duplicate names. See also #3077.

MichaelChirico · 2021-02-10T03:43:01Z

I think an error is the right way to go, probably we should error in setkey (and setindex?) as well. Easier than coming up with some ultimately-arbitrary way of dealing with merges on duplicate keys, as I see it.

The fix looks good, my nit is that I would include which columns are duplicated in the error message for user-friendliness.

ben-schwen · 2021-02-17T23:09:31Z

If we also include setkey, shouldn't we also consider to add a duplicate check to setnames too?

MichaelChirico · 2021-02-18T07:28:26Z

I don't see as much of a problem with duplicate names in the table in general as I do for the keys. For example duplicate column names can be essential for formatting output tables; forcing users to create workarounds for this common case seems like an unnecessary burden to me.

If there's really a compelling case to try and block duplicate column names, perhaps we could expose that through an option (e.g. datatable.strict.colnames) that errors when duplicate names are tried anywhere.

ben-schwen · 2021-02-18T08:32:17Z

The problem with setnames is that you can use it to alter the names of the keys.

For example

dt = data.table(a = 1:2, b = 3:4)
setkey(dt, "a", "b")
setnames(dt, "b", "a")
key(dt)
[1] "a" "a"

leads you back to the problem of duplicated keys although not directly setting them in the first place.

MichaelChirico · 2021-02-18T08:41:33Z

I see... I was thinking of the general case of duplicate names in setnames. Actually setnames does some specific checks about key(x), we could check anyDuplicated there without interfering with general use cases of setnames.

ayushnoori · 2021-05-25T06:53:42Z

Following up on this thread, recently noticed an issue where, after running setkey on a specific column x and then deleting that column x (i.e., df[, x := NULL]), basic data.table operations yield unexpected results (e.g., df[y == "cat"] does not work for my use case). Is this known/expected behavior?

Might be the same as #4088 hence mentioning here. This may very well be user error - have not had time to generate a MWE. Thanks for your help.

MichaelChirico added the consistency label Oct 2, 2021

This was referenced Dec 28, 2021

merge diagnostics argument/option #5298

Closed

output from merge #4677

Open

ben-schwen mentioned this issue Apr 6, 2022

Bug related to merging on character vs factor column when factor is sorted #5361

Open

ben-schwen mentioned this issue Feb 21, 2024

Merge returning wrong result when key is set. #5945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

MichaelChirico commented Feb 4, 2021

MichaelChirico commented Feb 4, 2021

magerton commented Feb 5, 2021

ColeMiller1 commented Feb 9, 2021

MichaelChirico commented Feb 10, 2021

ben-schwen commented Feb 17, 2021

MichaelChirico commented Feb 18, 2021

ben-schwen commented Feb 18, 2021

MichaelChirico commented Feb 18, 2021

ayushnoori commented May 25, 2021

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Comments

MichaelChirico commented Feb 4, 2021

MichaelChirico commented Feb 4, 2021

magerton commented Feb 5, 2021

ColeMiller1 commented Feb 9, 2021

MichaelChirico commented Feb 10, 2021

ben-schwen commented Feb 17, 2021

MichaelChirico commented Feb 18, 2021

ben-schwen commented Feb 18, 2021

MichaelChirico commented Feb 18, 2021

ayushnoori commented May 25, 2021