Setting "sorted" attribute in a data.table creates an erroneous merge #5687

dominiqueemmanuel · 2023-09-13T09:08:56Z

I don't know if it is really a bug or an intentended beahaviour, but to me it's a bug !
I have two data.table t1 and t2 like this

library(data.table)
t1 <- structure(list(ID = c("99", "100"))
                , class = c("data.table",  "data.frame")
)
t1
# ID
# 1:  99
# 2: 100

t2 <- structure(list(ID = "100", x = 1)
                , class = c("data.table","data.frame")
)
t2
# ID x
# 1: 100 1

The merge is OK :

merge(t1,t2,by="ID",all = TRUE)
# ID  x
# 1: 100  1
# 2:  99 NA

But If I have the same data.table with a sorted attribute (resulting from another merge for instance), the table looks the same :


t1 <- structure(list(ID = c("99", "100"))
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table",  "data.frame")
                )
t1
# ID
# 1:  99
# 2: 100

t2 <- structure(list(ID = "100", x = 1)
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table","data.frame")
                )
t2
# ID x
# 1: 100 1

But the merge is now erroneous :

merge(t1,t2,by="ID",all = TRUE)
# ID  x
# 1: 100 NA  <===== THIS IS WRONG
# 2: 100  1
# 3:  99 NA

Do you confirm it's a bug ?

Kind regards !
Dominique

packageVersion("data.table")
# [1] ‘1.14.8’

The text was updated successfully, but these errors were encountered:

ben-schwen · 2023-09-13T12:16:59Z

data.table internally uses the "sorted" attribute for storing keys. Setting these values own your own will result in a data.table that is actually not sorted, but believes it is, hence, the wrong result.

So similar to #5361

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

Setting keys the intended and only legal way, works fine

library(data.table)
t1 <- data.table(ID=c("99", "100"), key="ID")
t2 <- data.table(ID=c("100"), x=1, key="ID")
merge(t1, t2, by="ID", all=TRUE)
#>     ID  x
#> 1: 100  1
#> 2:  99 NA

Only way to really avoid this would be to check if sorting is correct before doing binary searches for fast subset, merging, etc.

jangorecki · 2023-09-13T17:16:38Z

Note that you specify that tables are sorted by ID but they aren't, at least as long as you use character columns... "100" < "99" is TRUE, in case you missed that.

t1 <- structure(list(ID = c("100","99"))
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table",  "data.frame")
                )
t2 <- structure(list(ID = "100", x = 1)
                , sorted = "ID"# <<-------- NEW
                , class = c("data.table","data.frame")
                )
merge(t1,t2,by="ID",all = TRUE)
#Key: <ID>
#       ID     x
#   <char> <num>
#1:    100     1
#2:     99    NA

So if you set sorted attribute according to the real order (note I fixed order of "100" and "99"), merge works fine. Anyway, the only supported way to set sorted attribute is to use setkey, setkeyv, [, keyby=].

jangorecki closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting "sorted" attribute in a data.table creates an erroneous merge #5687

Setting "sorted" attribute in a data.table creates an erroneous merge #5687

dominiqueemmanuel commented Sep 13, 2023

ben-schwen commented Sep 13, 2023 •

edited by jangorecki

Loading

jangorecki commented Sep 13, 2023 •

edited

Loading

Setting "sorted" attribute in a data.table creates an erroneous merge #5687

Setting "sorted" attribute in a data.table creates an erroneous merge #5687

Comments

dominiqueemmanuel commented Sep 13, 2023

ben-schwen commented Sep 13, 2023 • edited by jangorecki Loading

jangorecki commented Sep 13, 2023 • edited Loading

ben-schwen commented Sep 13, 2023 •

edited by jangorecki

Loading

jangorecki commented Sep 13, 2023 •

edited

Loading