-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug related to merging on character vs factor column when factor is sorted #5361
Comments
The join seems accurate to me. What were you expecting? The result matches with R library(data.table)
some_letters <- rev(letters[1:3])
some_more_letters <- rep(letters[1:3], 2)
dt1 <- data.table(x = some_letters, y = 1:3)
dt2 <- data.table(x = factor(some_more_letters, levels = some_letters), z = 1:6)
dt2 <- setkey(dt2, x, z)
# Calls data.table merge
dt3 <- merge(dt1, dt2, by = "x")
# Calls base merge
df3 <- merge(as.data.frame(dt1), as.data.frame(dt2), by="x")
fsetdiff(dt3, as.data.table(df3)) |
I expected |
It seems to be related to setkey. Once you remove the key, the behavior goes back to normal.
|
AFAIR the fast subsetting problem always arises when a At the example presented here, there arise multiple different issues and I'm not sure which is the best one to fix:
dt1 = data.table(x=c("c", "b", "a"))
dt2 = data.table(x=factor(c("a", "b", "c"), levels=c("c", "b", "a")))
setkey(dt2, x)
dt = dt2[dt1, on="x"]
|
I see the following behavior which I believe indicates a bug in
merge.data.table
:I believe the problem is that
dt3
thinks columnx
is sorted (and it would be if it was a factor), but it is not as a character. I assume thatdata.table
s has an internal optimized%in%
operator that uses this information and then gives the wrong result when we attempt to subset onx %in% "c"
. Finally, I assume that wrapping the subset operation in parenthesis avoids the use ofdata.table
s internal%in%
, so the subset works correctly as it no longer used the incorrectsorted
attribute ondt3
. Even if the last two assumptions are wrong, the behavior above seems incorrect.The closest thing I could find is issue #499, but I think is is different.
The text was updated successfully, but these errors were encountered: