Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge returning wrong result when key is set. #5945

Closed
sindribaldur opened this issue Feb 21, 2024 · 2 comments
Closed

Merge returning wrong result when key is set. #5945

sindribaldur opened this issue Feb 21, 2024 · 2 comments

Comments

@sindribaldur
Copy link

Suspecting this has something to with encoding but I failed to reproduce this with a more minimal example. We do have identical(A$attr, B$attr) but data.table disagrees when the key is set.

A = structure(list(attr = c("Alveg ósammála", "Ósammála", "Hvorki né", 
"Sammála", "Alveg sammála"), Fjoldi = c(0L, 0L, 1L, 34L, 52L
), Fjoldiv = c(0, 0, 0.89469781238413, 34.4705228031146, 56.0988977651936
), Hlutfall = c(0, 0, 0.00978195415015335, 0.376874816194491, 
0.613343229655356)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), sorted = "attr")
B = structure(list(attr = c("Alveg ósammála", "Ósammála", "Hvorki né", 
"Sammála", "Alveg sammála"), Fjoldi = c(1L, 0L, 5L, 26L, 13L
), Fjoldiv = c(0.868279569892473, 0, 4.52867383512545, 23.0122461170848, 
11.6621863799283), Hlutfall = c(0.021668318934995, 0, 0.113015153660955, 
0.574281263277156, 0.291035264126894)), row.names = c(NA, -5L
), class = c("data.table", "data.frame"))


merge(A, B, by = "attr", all = TRUE)
# Key: <attr>
#              attr Fjoldi.x  Fjoldiv.x  Hlutfall.x Fjoldi.y  Fjoldiv.y Hlutfall.y
#            <char>    <int>      <num>       <num>    <int>      <num>      <num>
# 1:  Alveg sammála       52 56.0988978 0.613343230       NA         NA         NA
# 2:  Alveg sammála       NA         NA          NA       13 11.6621864 0.29103526
# 3: Alveg ósammála        0  0.0000000 0.000000000        1  0.8682796 0.02166832
# 4:      Hvorki né        1  0.8946978 0.009781954        5  4.5286738 0.11301515
# 5:        Sammála       34 34.4705228 0.376874816       26 23.0122461 0.57428126
# 6:       Ósammála        0  0.0000000 0.000000000       NA         NA         NA
# 7:       Ósammála       NA         NA          NA        0  0.0000000 0.00000000
setkey(A, NULL)
merge(A, B, by = "attr", all = TRUE)
# Key: <attr>
#              attr Fjoldi.x  Fjoldiv.x  Hlutfall.x Fjoldi.y  Fjoldiv.y Hlutfall.y
#            <char>    <int>      <num>       <num>    <int>      <num>      <num>
# 1:  Alveg sammála       52 56.0988978 0.613343230       13 11.6621864 0.29103526
# 2: Alveg ósammála        0  0.0000000 0.000000000        1  0.8682796 0.02166832
# 3:      Hvorki né        1  0.8946978 0.009781954        5  4.5286738 0.11301515
# 4:        Sammála       34 34.4705228 0.376874816       26 23.0122461 0.57428126
# 5:       Ósammála        0  0.0000000 0.000000000        0  0.0000000 0.00000000

Using data.table 1.15.0.

@sindribaldur
Copy link
Author

My mistake. The attr column is actually not sorted so that must be the issue and not a bug in merge().

@ben-schwen
Copy link
Member

Looks like 2. in #5361

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

At the example presented here, there arise multiple different issues and I'm not sure which is the best one to fix:

  1. the merge of a character and a factor column returns a "wrong" result in the sense the that the result has a key although it it not sorted by the `key
dt1 = data.table(x=c("c", "b", "a"))
dt2 = data.table(x=factor(c("a", "b", "c"), levels=c("c", "b", "a")))
setkey(dt2, x)
dt = dt2[dt1, on="x"]
  1. fast subset does not work on a keyed data.table which is actually not sorted by the key
  2. setkeyv might have to check if it subsets the key, because currently sorted is not always trustworthy. This is related to keys are wrong/don't update if column names aren't unique #4888 and Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

Originally posted by @ben-schwen in #5361 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants