Merge returning wrong result when key is set. #5945

sindribaldur · 2024-02-21T10:15:17Z

Suspecting this has something to with encoding but I failed to reproduce this with a more minimal example. We do have identical(A$attr, B$attr) but data.table disagrees when the key is set.

A = structure(list(attr = c("Alveg ósammála", "Ósammála", "Hvorki né", 
"Sammála", "Alveg sammála"), Fjoldi = c(0L, 0L, 1L, 34L, 52L
), Fjoldiv = c(0, 0, 0.89469781238413, 34.4705228031146, 56.0988977651936
), Hlutfall = c(0, 0, 0.00978195415015335, 0.376874816194491, 
0.613343229655356)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), sorted = "attr")
B = structure(list(attr = c("Alveg ósammála", "Ósammála", "Hvorki né", 
"Sammála", "Alveg sammála"), Fjoldi = c(1L, 0L, 5L, 26L, 13L
), Fjoldiv = c(0.868279569892473, 0, 4.52867383512545, 23.0122461170848, 
11.6621863799283), Hlutfall = c(0.021668318934995, 0, 0.113015153660955, 
0.574281263277156, 0.291035264126894)), row.names = c(NA, -5L
), class = c("data.table", "data.frame"))


merge(A, B, by = "attr", all = TRUE)
# Key: <attr>
#              attr Fjoldi.x  Fjoldiv.x  Hlutfall.x Fjoldi.y  Fjoldiv.y Hlutfall.y
#            <char>    <int>      <num>       <num>    <int>      <num>      <num>
# 1:  Alveg sammála       52 56.0988978 0.613343230       NA         NA         NA
# 2:  Alveg sammála       NA         NA          NA       13 11.6621864 0.29103526
# 3: Alveg ósammála        0  0.0000000 0.000000000        1  0.8682796 0.02166832
# 4:      Hvorki né        1  0.8946978 0.009781954        5  4.5286738 0.11301515
# 5:        Sammála       34 34.4705228 0.376874816       26 23.0122461 0.57428126
# 6:       Ósammála        0  0.0000000 0.000000000       NA         NA         NA
# 7:       Ósammála       NA         NA          NA        0  0.0000000 0.00000000
setkey(A, NULL)
merge(A, B, by = "attr", all = TRUE)
# Key: <attr>
#              attr Fjoldi.x  Fjoldiv.x  Hlutfall.x Fjoldi.y  Fjoldiv.y Hlutfall.y
#            <char>    <int>      <num>       <num>    <int>      <num>      <num>
# 1:  Alveg sammála       52 56.0988978 0.613343230       13 11.6621864 0.29103526
# 2: Alveg ósammála        0  0.0000000 0.000000000        1  0.8682796 0.02166832
# 3:      Hvorki né        1  0.8946978 0.009781954        5  4.5286738 0.11301515
# 4:        Sammála       34 34.4705228 0.376874816       26 23.0122461 0.57428126
# 5:       Ósammála        0  0.0000000 0.000000000        0  0.0000000 0.00000000

Using data.table 1.15.0.

The text was updated successfully, but these errors were encountered:

sindribaldur · 2024-02-21T11:49:33Z

My mistake. The attr column is actually not sorted so that must be the issue and not a bug in merge().

ben-schwen · 2024-02-21T13:35:11Z

Looks like 2. in #5361

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

At the example presented here, there arise multiple different issues and I'm not sure which is the best one to fix:

the merge of a character and a factor column returns a "wrong" result in the sense the that the result has a key although it it not sorted by the `key
dt1 = data.table(x=c("c", "b", "a"))
dt2 = data.table(x=factor(c("a", "b", "c"), levels=c("c", "b", "a")))
setkey(dt2, x)
dt = dt2[dt1, on="x"]
fast subset does not work on a keyed data.table which is actually not sorted by the key

setkeyv might have to check if it subsets the key, because currently sorted is not always trustworthy. This is related to keys are wrong/don't update if column names aren't unique #4888 and Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Originally posted by @ben-schwen in #5361 (comment)

sindribaldur closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge returning wrong result when key is set. #5945

Merge returning wrong result when key is set. #5945

sindribaldur commented Feb 21, 2024

sindribaldur commented Feb 21, 2024

ben-schwen commented Feb 21, 2024

Merge returning wrong result when key is set. #5945

Merge returning wrong result when key is set. #5945

Comments

sindribaldur commented Feb 21, 2024

sindribaldur commented Feb 21, 2024

ben-schwen commented Feb 21, 2024