Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge diagnostics argument/option #5298

Closed
grantmcdermott opened this issue Dec 28, 2021 · 2 comments
Closed

merge diagnostics argument/option #5298

grantmcdermott opened this issue Dec 28, 2021 · 2 comments

Comments

@grantmcdermott
Copy link
Contributor

grantmcdermott commented Dec 28, 2021

Prompted by this tweet by @johnjosephhorton (whom, I hope, doesn't mind being tagged).

An R function I'd find useful: diagnostics of tables joins---column overlap to start, multiple matches, # of non-matches, missing keys & how they were handled, maybe a venn diagram of column names, etc. Something like this already exist? Joining is a high risk data cleaning task

I'd advocate for a new diagnostics logical argument (whose default behaviour could fall back to an option), since verbose already returns quite a lot of unrelated information.

Worth noting that the tidylog package largely provides this behaviour for dplyr joins.

library(dplyr, warn.conflicts = FALSE)
library(tidylog, warn.conflicts = FALSE)

x = data.frame(A = 1:5, B = 6:10)
y = data.frame(A = c(1, 4), C = LETTERS[c(1, 4)])

left_join(x, y)
#> Joining, by = "A"
#> left_join: added one column (C)
#>            > rows only in x   3
#>            > rows only in y  (0)
#>            > matched rows     2
#>            >                 ===
#>            > rows total       5
#>   A  B    C
#> 1 1  6    A
#> 2 2  7 <NA>
#> 3 3  8 <NA>
#> 4 4  9    D
#> 5 5 10 <NA>

Created on 2021-12-27 by the reprex package (v2.0.1)

As per John's tweet, additional data.table-specific information like keys would be valuable here too. Doesn't resolve, but would go quite far towards heading off issues like #4888 and #4891.

Session Info

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.18.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidylog_1.0.2 dplyr_1.0.7  
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.6.4      compiler_4.1.2    highr_0.9         R.methodsS3_1.8.1
#>  [5] R.utils_2.11.0    tools_4.1.2       digest_0.6.29     evaluate_0.14    
#>  [9] lifecycle_1.0.1   tibble_3.1.6      R.cache_0.15.0    pkgconfig_2.0.3  
#> [13] rlang_0.4.12      reprex_2.0.1      DBI_1.1.1         yaml_2.2.1       
#> [17] xfun_0.28         fastmap_1.1.0     withr_2.4.3       styler_1.6.2     
#> [21] stringr_1.4.0     knitr_1.36        generics_0.1.1    fs_1.5.2         
#> [25] vctrs_0.3.8       tidyselect_1.1.1  glue_1.5.1        R6_2.5.1         
#> [29] fansi_0.5.0       rmarkdown_2.11    tidyr_1.1.4       purrr_0.3.4      
#> [33] magrittr_2.0.1    clisymbols_1.2.0  backports_1.4.1   ellipsis_0.3.2   
#> [37] htmltools_0.5.2   assertthat_0.2.1  utf8_1.2.2        stringi_1.7.6    
#> [41] crayon_1.4.2      R.oo_1.24.0

Created on 2021-12-27 by the reprex package (v2.0.1)

@ben-schwen
Copy link
Member

ben-schwen commented Dec 28, 2021

Related/potential duplicate to #4677

@grantmcdermott
Copy link
Contributor Author

Ah, don't know how I missed that. Agree it's a dup, so will close and add a comment there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants