Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

which = NA yields incorrect results #4303

Open
cbilot opened this issue Mar 14, 2020 · 4 comments · May be fixed by #4342
Open

which = NA yields incorrect results #4303

cbilot opened this issue Mar 14, 2020 · 4 comments · May be fixed by #4342
Labels
bug joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins

Comments

@cbilot
Copy link

cbilot commented Mar 14, 2020

While working with the "which = (FALSE/TRUE/NA)" option of a data.table join, I received some quite unexpected results.

Let's start with a simple case that works correctly. Intuitively, we are asking the question "which orders do not have a valid customer ID"?

orders <-  
  data.table(  
    ID = c(105257553L)  
  )  
    
customers <-    
  data.table(  
    ID = c(  
      108924851L, 105257553L, 118054200L, 108365953L,  
      116642294L, 100419961L, 115677488L, 100405475L,  
      119246064L, 100383251L  
    )  
  )                  
    
customers[orders,  
  on = .(ID),  
  which = NA  
]  

This yields the (correct) output:
integer(0)
because ID: 105257553 (the only row in the orders table) is the second ID in the customers table.

FWIW, with datatable.verbose = TRUE, we get the following output:

i.ID has same type (integer) as x.ID. No coercion needed.
forder.c received 10 rows and 1 columns
Calculated ad hoc index in 0.000s elapsed (0.000s cpu) 
Starting bmerge ...
forder.c received 1 rows and 1 columns
bmerge done in 0.001s elapsed (0.000s cpu) 
forder.c received a vector type 'integer' length 0
integer(0)

Now, let's make one tiny change. Let's change only the last digit of the ID in the orders table from 105257553 to 105257554. Note: this ID is not in the customers table.

orders <-  
  data.table(  
    ID = c(105257554L)  
  )  
    
customers <-    
  data.table(  
    ID = c(  
      108924851L, 105257553L, 118054200L, 108365953L,  
      116642294L, 100419961L, 115677488L, 100405475L,  
      119246064L, 100383251L  
    )  
  )                  
    
customers[orders, on = .(ID), which = NA]   

Now, we get following output:
[1] 1 2 3 4 5 6 7 8 9 10
This result cannot be correct. For one, the orders table has only one row, so no element of the result should be anything other than 1. Also notice that we receive 10 values in our result -- even though our orders table only has one row.

But things get even stranger. Let's keep ID 105257554 (which is not in the customers table), but add a second ID to the orders table. This time let's add ID: 108924851, which is in the customers table:

orders <-  
  data.table(  
    ID = c(105257554L, 108924851L)
  )  
    
customers <-    
  data.table(  
    ID = c(  
      108924851L, 105257553L, 118054200L, 108365953L,  
      116642294L, 100419961L, 115677488L, 100405475L,  
      119246064L, 100383251L  
    )  
  )                  
    
customers[orders, on = .(ID), which = NA]   

Now we get the following output:
[1] 3 4 6 7 10
Certainly not the expected result. But perhaps more strangely .. the values 5, 8, and 9 disappear, compared to the prior result.

FWIW, I discovered this while working on some largish datasets (i of size ~200 million records, x of size ~25 million records). I received results that often contained a large number of "NA" values, as well as result vectors of strange sizes.

Oddly, when I tried to construct simplified examples for this issue ticket using made-up ID's, all of them worked correctly. The above examples I found by using real data.

As a mitigation for now, I simply revert to using code like:

orders[, missing_cust_ID := FALSE]
orders[!customers, missing_cust_ID := TRUE, on = .(ID)] 
print(orders)
           ID missing_cust_ID
 1: 105257554            TRUE
 2: 108924851           FALSE

# Output of sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] nvimcom_0.9-86       rvisidata_0.0.0.9000 colorout_1.2-2       data.table_1.12.8   

loaded via a namespace (and not attached):
 [1] compiler_3.6.3  readr_1.3.1     R6_2.4.1        magrittr_1.5    hms_0.5.3       tools_3.6.3     pillar_1.4.3    tibble_2.1.3    Rcpp_1.0.3     
[10] crayon_1.3.4    vctrs_0.2.4     pkgconfig_2.0.3 rlang_0.4.5    
@jangorecki
Copy link
Member

Thnank you @cbilot for reporting. I am able to reproduce the problem.

jangorecki added a commit that referenced this issue Apr 2, 2020
@jangorecki jangorecki linked a pull request Apr 2, 2020 that will close this issue
@jangorecki
Copy link
Member

@cbilot are you able to try branch from #4342 to see if it resolves all your use cases?

@jangorecki jangorecki added bug joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins labels Apr 2, 2020
@cbilot
Copy link
Author

cbilot commented Apr 2, 2020

@jangorecki , #4342 does indeed resolve all my use cases.

@cbilot cbilot closed this as completed Apr 2, 2020
@MichaelChirico MichaelChirico reopened this Apr 3, 2020
@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 3, 2020

No need to close the issue, it will happen automatically once the PR reference above is included to the master branch.

Thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants