Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join with index delivers unexpected results if indexed column name is a prefix of the join column name #2465

Closed
pannnda opened this issue Nov 6, 2017 · 3 comments · Fixed by #2476
Labels
bug joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Milestone

Comments

@pannnda
Copy link

pannnda commented Nov 6, 2017

This issue is based on my question on stackoverflow.

For a specific set-up of two data.tables a join does not deliver the results I am expecting.

library(data.table)

# In the code below the join does not deliver the result I would expect
DT1 <- data.table(colname=c("test1","test2","test2","test3"), colname_with_suffix=c("other","test","includes test within","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[colname_with_suffix == "not found", ]  # automatically creates index on colname_with_suffix
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]
# PLEASE NOTE: same result with slightly different syntax: DT1[DT2, lookup_result := i.lookup_result, on=c(colname="lookup")][]
# colname  colname_with_suffix lookup_result
# 1:   test1                other         NA
# 2:   test2                 test         NA
# 3:   test2 includes test within         NA
# 4:   test3                other          3


# Expected result:
 # colname  colname_with_suffix lookup_result
# 1:   test1                other          1
# 2:   test2                 test          2
# 3:   test2 includes test within          2
# 4:   test3                other          3    

For the following variations the join works as expected. The unexpected behaviour above seems to be occurring only, if an index exists on a column having a column name being the prefix of the join column name and both having similar text contents.

# For all following alternatives the join delivers the correct result

# (a) Same data tables as above, but no index
DT1 <- data.table(colname=c("test1","test2","test2","test3"), colname_with_suffix=c("other","test","includes test within","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]

# (b) Index on DT2, but completely different values in indexed column than in join column
DT1 <- data.table(colname=c("test1","test2","test2","test3"), colname_with_suffix=c("other","other","other","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[colname_with_suffix == "not found", ]  # automatically creates index on colname_with_suffix
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]

# (c) Index on DT2, similar values in indexed column, but indexed column name is not a prefix of join column name
DT1 <- data.table(colname=c("test1","test2","test2","test3"), x.colname_with_suffix=c("other","test","includes test within","other"))
DT2 <- data.table(lookup=c("test1","test2","test3"), lookup_result=c(1,2,3))
DT1[x.colname_with_suffix == "not found", ]  # automatically creates index on x.colname_with_suffix
DT1[DT2, lookup_result := i.lookup_result, on=c("colname"="lookup")][]

SessionInfo:

# R version 3.3.2 (2016-10-31)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
#     [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252    
# 
# attached base packages:
#     [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#     [1] data.table_1.10.0
# 
# loaded via a namespace (and not attached):
#     [1] tools_3.3.2

Please note that the same behavior occurs for data.table 1.10.4 and R.Version 3.4.2 under Windows and also Ubuntu Linux 14.04.

@MarkusBonsch
Copy link
Contributor

This seems to be a bug that deserves fixing. I will dig into it and try to fix as my time allows.

@st-pasha st-pasha added bug joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins labels Nov 10, 2017
MarkusBonsch pushed a commit to MarkusBonsch/data.table that referenced this issue Nov 11, 2017
@MarkusBonsch
Copy link
Contributor

This was an easy fix (see pull request). Thanks for reporting the bug and hope that the issue is solved now.

@aryoda
Copy link

aryoda commented Nov 12, 2017

@MarkusBonsch I have just applied your commit to the most-recent dev version of data.table and tested it with the two examples from the linked SO question.

Both examples do work now as expected!

Many THX for the quick fix.

@mattdowle mattdowle added this to the v1.10.6 milestone Nov 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants