Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug related to merging on character vs factor column when factor is sorted #5361

Open
gbrunick opened this issue Apr 5, 2022 · 4 comments
Open

Comments

@gbrunick
Copy link

gbrunick commented Apr 5, 2022

I see the following behavior which I believe indicates a bug in merge.data.table:

> library(data.table)
> packageDescription("data.table")$Version
[1] "1.14.2"
> some_letters <- rev(letters[1:3])
> some_more_letters <- rep(letters[1:3], 2)
> dt1 <- data.table(x = some_letters, y = 1:3)
> dt2 <- data.table(x = factor(some_more_letters, levels = some_letters), z = 1:6)
> dt2 <- setkey(dt2, x, z)
> dt3 <- merge(dt1, dt2, by = "x")
> str(dt3)
Classes ‘data.table’ and 'data.frame':	6 obs. of  3 variables:
 $ x: chr  "c" "c" "b" "b" ...
 $ y: int  1 1 2 2 3 3
 $ z: int  3 6 2 5 1 4
 - attr(*, "sorted")= chr "x"
 - attr(*, ".internal.selfref")=<externalptr> 
> dt3[x %in% "c", ]
Empty data.table (0 rows and 3 cols): x,y,z
> dt3[(x %in% "c"), ]
   x y z
1: c 1 3
2: c 1 6

I believe the problem is that dt3 thinks column x is sorted (and it would be if it was a factor), but it is not as a character. I assume that data.tables has an internal optimized %in% operator that uses this information and then gives the wrong result when we attempt to subset on x %in% "c". Finally, I assume that wrapping the subset operation in parenthesis avoids the use of data.tables internal %in%, so the subset works correctly as it no longer used the incorrect sorted attribute on dt3. Even if the last two assumptions are wrong, the behavior above seems incorrect.

The closest thing I could find is issue #499, but I think is is different.

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_3.6.3
@avimallu
Copy link
Contributor

avimallu commented Apr 5, 2022

The join seems accurate to me. What were you expecting? The result matches with R base too.

library(data.table)
some_letters <- rev(letters[1:3])
some_more_letters <- rep(letters[1:3], 2)

dt1 <- data.table(x = some_letters, y = 1:3)
dt2 <- data.table(x = factor(some_more_letters, levels = some_letters), z = 1:6)
dt2 <- setkey(dt2, x, z)

# Calls data.table merge
dt3 <- merge(dt1, dt2, by = "x")
# Calls base merge
df3 <- merge(as.data.frame(dt1), as.data.frame(dt2), by="x")

fsetdiff(dt3, as.data.table(df3))

@gbrunick
Copy link
Author

gbrunick commented Apr 5, 2022

I expected dt3[x %in% "c", ] not to be empty after the merge. In particular, I expect dt3[x %in% "c", ] and dt3[(x %in% "c"), ] to agree.

@daniellga
Copy link

daniellga commented Apr 6, 2022

It seems to be related to setkey. Once you remove the key, the behavior goes back to normal.

> library(data.table)

> dt3 <- structure(list(x = c("c", "c", "b", "b", "a", "a"), y = c(1L, 1L, 2L, 2L, 3L, 3L), z = c(3L, 6L, 2L, 5L, 1L, 4L)), sorted = "x", class = c("data.table", "data.frame"), row.names = c(NA, -6L))

> dt3[x %in% c("a", "b", "c"), ]
   x y z
1: b 2 2
2: b 2 5

> setkey(dt3, NULL)

> dt3[x %in% c("a", "b", "c"), ]
   x y z
1: c 1 3
2: c 1 6
3: b 2 2
4: b 2 5
5: a 3 1
6: a 3 4

@ben-schwen
Copy link
Member

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

At the example presented here, there arise multiple different issues and I'm not sure which is the best one to fix:

  1. the merge of a character and a factor column returns a "wrong" result in the sense the that the result has a key although it it not sorted by the `key
dt1 = data.table(x=c("c", "b", "a"))
dt2 = data.table(x=factor(c("a", "b", "c"), levels=c("c", "b", "a")))
setkey(dt2, x)
dt = dt2[dt1, on="x"]
  1. fast subset does not work on a keyed data.table which is actually not sorted by the key
  2. setkeyv might have to check if it subsets the key, because currently sorted is not always trustworthy. This is related to keys are wrong/don't update if column names aren't unique #4888 and Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants