Bug related to merging on character vs factor column when factor is sorted #5361

gbrunick · 2022-04-05T13:49:43Z

I see the following behavior which I believe indicates a bug in merge.data.table:

> library(data.table)
> packageDescription("data.table")$Version
[1] "1.14.2"
> some_letters <- rev(letters[1:3])
> some_more_letters <- rep(letters[1:3], 2)
> dt1 <- data.table(x = some_letters, y = 1:3)
> dt2 <- data.table(x = factor(some_more_letters, levels = some_letters), z = 1:6)
> dt2 <- setkey(dt2, x, z)
> dt3 <- merge(dt1, dt2, by = "x")
> str(dt3)
Classes ‘data.table’ and 'data.frame':	6 obs. of  3 variables:
 $ x: chr  "c" "c" "b" "b" ...
 $ y: int  1 1 2 2 3 3
 $ z: int  3 6 2 5 1 4
 - attr(*, "sorted")= chr "x"
 - attr(*, ".internal.selfref")=<externalptr> 
> dt3[x %in% "c", ]
Empty data.table (0 rows and 3 cols): x,y,z
> dt3[(x %in% "c"), ]
   x y z
1: c 1 3
2: c 1 6

I believe the problem is that dt3 thinks column x is sorted (and it would be if it was a factor), but it is not as a character. I assume that data.tables has an internal optimized %in% operator that uses this information and then gives the wrong result when we attempt to subset on x %in% "c". Finally, I assume that wrapping the subset operation in parenthesis avoids the use of data.tables internal %in%, so the subset works correctly as it no longer used the incorrect sorted attribute on dt3. Even if the last two assumptions are wrong, the behavior above seems incorrect.

The closest thing I could find is issue #499, but I think is is different.

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_3.6.3

The text was updated successfully, but these errors were encountered:

avimallu · 2022-04-05T17:17:29Z

The join seems accurate to me. What were you expecting? The result matches with R base too.

library(data.table)
some_letters <- rev(letters[1:3])
some_more_letters <- rep(letters[1:3], 2)

dt1 <- data.table(x = some_letters, y = 1:3)
dt2 <- data.table(x = factor(some_more_letters, levels = some_letters), z = 1:6)
dt2 <- setkey(dt2, x, z)

# Calls data.table merge
dt3 <- merge(dt1, dt2, by = "x")
# Calls base merge
df3 <- merge(as.data.frame(dt1), as.data.frame(dt2), by="x")

fsetdiff(dt3, as.data.table(df3))

gbrunick · 2022-04-05T17:42:00Z

I expected dt3[x %in% "c", ] not to be empty after the merge. In particular, I expect dt3[x %in% "c", ] and dt3[(x %in% "c"), ] to agree.

daniellga · 2022-04-06T14:04:45Z

It seems to be related to setkey. Once you remove the key, the behavior goes back to normal.

> library(data.table)

> dt3 <- structure(list(x = c("c", "c", "b", "b", "a", "a"), y = c(1L, 1L, 2L, 2L, 3L, 3L), z = c(3L, 6L, 2L, 5L, 1L, 4L)), sorted = "x", class = c("data.table", "data.frame"), row.names = c(NA, -6L))

> dt3[x %in% c("a", "b", "c"), ]
   x y z
1: b 2 2
2: b 2 5

> setkey(dt3, NULL)

> dt3[x %in% c("a", "b", "c"), ]
   x y z
1: c 1 3
2: c 1 6
3: b 2 2
4: b 2 5
5: a 3 1
6: a 3 4

ben-schwen · 2022-04-06T19:05:47Z

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

At the example presented here, there arise multiple different issues and I'm not sure which is the best one to fix:

the merge of a character and a factor column returns a "wrong" result in the sense the that the result has a key although it it not sorted by the `key

dt1 = data.table(x=c("c", "b", "a"))
dt2 = data.table(x=factor(c("a", "b", "c"), levels=c("c", "b", "a")))
setkey(dt2, x)
dt = dt2[dt1, on="x"]

fast subset does not work on a keyed data.table which is actually not sorted by the key
setkeyv might have to check if it subsets the key, because currently sorted is not always trustworthy. This is related to keys are wrong/don't update if column names aren't unique #4888 and Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

ben-schwen mentioned this issue Apr 6, 2022

Merge on factor and character returns wrongly keyed data.table #5362

Open

iMissile mentioned this issue May 23, 2022

Error related to merging on character vs factor column #5391

Closed

ben-schwen mentioned this issue Dec 21, 2022

I had expected rbindlist() to preserve attributes, but it doesn't. It'd be a nice feature; or an important limitation to disclose in the documentation. #5569

Open

ben-schwen mentioned this issue Jan 25, 2023

Using a function inside by causes troubles #5583

Open

ben-schwen mentioned this issue Sep 13, 2023

Setting "sorted" attribute in a data.table creates an erroneous merge #5687

Closed

ben-schwen mentioned this issue Feb 21, 2024

Merge returning wrong result when key is set. #5945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug related to merging on character vs factor column when factor is sorted #5361

Bug related to merging on character vs factor column when factor is sorted #5361

gbrunick commented Apr 5, 2022

avimallu commented Apr 5, 2022

gbrunick commented Apr 5, 2022 •

edited

Loading

daniellga commented Apr 6, 2022 •

edited

Loading

ben-schwen commented Apr 6, 2022

Bug related to merging on character vs factor column when factor is sorted #5361

Bug related to merging on character vs factor column when factor is sorted #5361

Comments

gbrunick commented Apr 5, 2022

avimallu commented Apr 5, 2022

gbrunick commented Apr 5, 2022 • edited Loading

daniellga commented Apr 6, 2022 • edited Loading

ben-schwen commented Apr 6, 2022

gbrunick commented Apr 5, 2022 •

edited

Loading

daniellga commented Apr 6, 2022 •

edited

Loading