tstrsplit bug when number of columns assigned in j < number of splits #3495

emilBeBri · 2019-04-09T12:16:31Z

Hi,

When using tstrsplit() with a number of new variables created by the split less than the number of occurrences of the chosen delimiter, everything after the last new variable is removed, instead of contained in the last new variable.

Like this:

# Minimal reproducible example

# load and create data
library(data.table)
DT <- data.table(
	string=c('this is xcutx a nice string', 'this is xcutx also a nice string xcutx whith a little problem'),
	id=1:2)
# assigningen to two variables. observation two has three cutpoints with the chosen delimiter
DT[,(c('V1','V2')):= tstrsplit(string, 'xcutx')]

in the second observation, because it has 2 cutpoints, everything after the second cut is removed. I would think this is a undesirable result for most users.

when using stri_split_regex() from the stringi-package you can set the n-argument and get the expected result:

stringi::stri_split_regex(DT$string, 'xcutx', n=2)

by transposing, just as tstrsplit, the function can be used inside data.table:

DT[ ,(c('V1','V2')):= transpose(stringi::stri_split_regex(string, 'xcutx', n=2))]

# Output of sessionInfo()

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.4.3     data.table_1.12.0

loaded via a namespace (and not attached):
[1] compiler_3.5.3 tools_3.5.3

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2019-04-09T16:13:52Z

Please update data.table and try again; I get:

Error in [.data.table(DT, , :=((c("V1", "V2")), tstrsplit(string, :
Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.

This is the correct behavior.

Agreed it can be cumbersome to anticipate the output of tstrsplit (especially on bigger data sets); see #1543 for a related issue.

emilBeBri · 2019-04-10T13:19:16Z

well I'll be damned, sorry for posting a redundant bug. Thought I had been careful in checking in wether it had been filed or fixed.

MichaelChirico · 2019-04-13T06:16:02Z

@emilBeBri don't worry about it, searching can be hard. Thank you for taking the time to report with an MRE 👍

MichaelChirico closed this as completed Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tstrsplit bug when number of columns assigned in j < number of splits #3495

tstrsplit bug when number of columns assigned in j < number of splits #3495

emilBeBri commented Apr 9, 2019 •

edited

Loading

MichaelChirico commented Apr 9, 2019

emilBeBri commented Apr 10, 2019

MichaelChirico commented Apr 13, 2019

tstrsplit bug when number of columns assigned in j < number of splits #3495

tstrsplit bug when number of columns assigned in j < number of splits #3495

Comments

emilBeBri commented Apr 9, 2019 • edited Loading

MichaelChirico commented Apr 9, 2019

emilBeBri commented Apr 10, 2019

MichaelChirico commented Apr 13, 2019

emilBeBri commented Apr 9, 2019 •

edited

Loading