Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tstrsplit bug when number of columns assigned in j < number of splits #3495

Closed
emilBeBri opened this issue Apr 9, 2019 · 3 comments
Closed

Comments

@emilBeBri
Copy link

emilBeBri commented Apr 9, 2019

Hi,

When using tstrsplit() with a number of new variables created by the split less than the number of occurrences of the chosen delimiter, everything after the last new variable is removed, instead of contained in the last new variable.

Like this:

# Minimal reproducible example

# load and create data
library(data.table)
DT <- data.table(
	string=c('this is xcutx a nice string', 'this is xcutx also a nice string xcutx whith a little problem'),
	id=1:2)
# assigningen to two variables. observation two has three cutpoints with the chosen delimiter
DT[,(c('V1','V2')):= tstrsplit(string, 'xcutx')]

in the second observation, because it has 2 cutpoints, everything after the second cut is removed. I would think this is a undesirable result for most users.

when using stri_split_regex() from the stringi-package you can set the n-argument and get the expected result:

stringi::stri_split_regex(DT$string, 'xcutx', n=2)

by transposing, just as tstrsplit, the function can be used inside data.table:

DT[ ,(c('V1','V2')):= transpose(stringi::stri_split_regex(string, 'xcutx', n=2))]

# Output of sessionInfo()

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.4.3     data.table_1.12.0

loaded via a namespace (and not attached):
[1] compiler_3.5.3 tools_3.5.3  
@MichaelChirico
Copy link
Member

Please update data.table and try again; I get:

Error in [.data.table(DT, , :=((c("V1", "V2")), tstrsplit(string, :
Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.

This is the correct behavior.

Agreed it can be cumbersome to anticipate the output of tstrsplit (especially on bigger data sets); see #1543 for a related issue.

@emilBeBri
Copy link
Author

well I'll be damned, sorry for posting a redundant bug. Thought I had been careful in checking in wether it had been filed or fixed.

@MichaelChirico
Copy link
Member

@emilBeBri don't worry about it, searching can be hard. Thank you for taking the time to report with an MRE 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants