Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting "sorted" attribute in a data.table creates an erroneous merge #5687

Closed
dominiqueemmanuel opened this issue Sep 13, 2023 · 2 comments
Closed

Comments

@dominiqueemmanuel
Copy link

I don't know if it is really a bug or an intentended beahaviour, but to me it's a bug !
I have two data.table t1 and t2 like this

library(data.table)
t1 <- structure(list(ID = c("99", "100"))
                , class = c("data.table",  "data.frame")
)
t1
# ID
# 1:  99
# 2: 100

t2 <- structure(list(ID = "100", x = 1)
                , class = c("data.table","data.frame")
)
t2
# ID x
# 1: 100 1

The merge is OK :

merge(t1,t2,by="ID",all = TRUE)
# ID  x
# 1: 100  1
# 2:  99 NA

But If I have the same data.table with a sorted attribute (resulting from another merge for instance), the table looks the same :


t1 <- structure(list(ID = c("99", "100"))
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table",  "data.frame")
                )
t1
# ID
# 1:  99
# 2: 100

t2 <- structure(list(ID = "100", x = 1)
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table","data.frame")
                )
t2
# ID x
# 1: 100 1

But the merge is now erroneous :

merge(t1,t2,by="ID",all = TRUE)
# ID  x
# 1: 100 NA  <===== THIS IS WRONG
# 2: 100  1
# 3:  99 NA

Do you confirm it's a bug ?

Kind regards !
Dominique

packageVersion("data.table")
# [1] ‘1.14.8’
@ben-schwen
Copy link
Member

ben-schwen commented Sep 13, 2023

data.table internally uses the "sorted" attribute for storing keys. Setting these values own your own will result in a data.table that is actually not sorted, but believes it is, hence, the wrong result.

So similar to #5361

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

Setting keys the intended and only legal way, works fine

library(data.table)
t1 <- data.table(ID=c("99", "100"), key="ID")
t2 <- data.table(ID=c("100"), x=1, key="ID")
merge(t1, t2, by="ID", all=TRUE)
#>     ID  x
#> 1: 100  1
#> 2:  99 NA

Only way to really avoid this would be to check if sorting is correct before doing binary searches for fast subset, merging, etc.

@jangorecki
Copy link
Member

jangorecki commented Sep 13, 2023

Note that you specify that tables are sorted by ID but they aren't, at least as long as you use character columns... "100" < "99" is TRUE, in case you missed that.

t1 <- structure(list(ID = c("100","99"))
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table",  "data.frame")
                )
t2 <- structure(list(ID = "100", x = 1)
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table","data.frame")
                )
merge(t1,t2,by="ID",all = TRUE)
#Key: <ID>
#       ID     x
#   <char> <num>
#1:    100     1
#2:     99    NA

So if you set sorted attribute according to the real order (note I fixed order of "100" and "99"), merge works fine. Anyway, the only supported way to set sorted attribute is to use setkey, setkeyv, [, keyby=].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants