Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird issue related to index and non-ASCII character #1826

Closed
shrektan opened this issue Aug 25, 2016 · 20 comments
Closed

Weird issue related to index and non-ASCII character #1826

shrektan opened this issue Aug 25, 2016 · 20 comments
Labels
encoding issues related to Encoding
Milestone

Comments

@shrektan
Copy link
Member

Hi, I want to report an issue related to non-ASCII character when join use the index or key. It's complicated to explain in words. Luckily, I have a reproducible example as the following (took me 3 hours to find the example T.T ):

Under current dev version (1.9.7) of data.table

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                                              
##  version  R version 3.3.1 (2016-06-21)                       
##  system   i386, mingw32                                      
##  ui       RTerm                                              
##  language (EN)                                               
##  collate  Chinese (Simplified)_People's Republic of China.936
##  tz       Asia/Taipei                                        
##  date     2016-08-25
## Packages ------------------------------------------------------------------
##  package    * version date       source        
##  data.table * 1.9.7   2016-08-25 local         
##  devtools     1.12.0  2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
##  evaluate     0.9     2016-04-29 CRAN (R 3.2.5)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.2.4)
##  knitr        1.14    2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5     2014-11-22 CRAN (R 3.1.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
##  Rcpp         0.12.4  2016-03-26 CRAN (R 3.2.4)
##  rmarkdown    1.0     2016-07-08 CRAN (R 3.3.1)
##  stringi      1.1.1   2016-05-27 CRAN (R 3.2.5)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.1)
##  withr        1.0.2   2016-06-20 CRAN (R 3.2.5)

Under CRAN version (1.9.6) of data.table

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## Warning in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends,
## nomatch, : A known encoding (latin1 or UTF-8) was detected in a join
## column. data.table compares the bytes currently, so doesn't support *mixed*
## encodings well; i.e., using both latin1 and UTF-8, or if any unknown
## encodings are non-ascii and some of those are marked known and others
## not. But if either latin1 or UTF-8 is used exclusively, and all unknown
## encodings are ascii, then the result should be ok. In future we will check
## for you and avoid this warning if everything is ok. The tricky part is
## doing this without impacting performance for ascii-only cases.
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                                              
##  version  R version 3.3.1 (2016-06-21)                       
##  system   i386, mingw32                                      
##  ui       RTerm                                              
##  language (EN)                                               
##  collate  Chinese (Simplified)_People's Republic of China.936
##  tz       Asia/Taipei                                        
##  date     2016-08-25
## Packages ------------------------------------------------------------------
##  package    * version date       source        
##  chron        2.3-47  2015-06-24 CRAN (R 3.2.1)
##  data.table * 1.9.6   2015-09-19 CRAN (R 3.3.1)
##  devtools     1.12.0  2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
##  evaluate     0.9     2016-04-29 CRAN (R 3.2.5)
##  htmltools    0.3.5   2016-03-21 CRAN (R 3.2.4)
##  knitr        1.14    2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5     2014-11-22 CRAN (R 3.1.2)
##  memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
##  Rcpp         0.12.6  2016-07-19 CRAN (R 3.3.1)
##  rmarkdown    1.0     2016-07-08 CRAN (R 3.3.1)
##  stringi      1.1.1   2016-05-27 CRAN (R 3.2.5)
##  stringr      1.1.0   2016-08-19 CRAN (R 3.3.1)
##  withr        1.0.2   2016-06-20 CRAN (R 3.2.5)

Note

As you can see, the behavior changes under the different version of data.table. And I can't reproduce the example without the csv file. I'm not sure if it only occurs when the data is read from a csv file or from the database... And in my real cases, the thing happens like "at first it's ok, but when I set the encoding to native, it won't work. And then I set to UTF-8, it's ok. And then I set to native again, it works~"...

I strongly doubt it's an issue related to the commits within 3 months, because I'm kind of updating the dev version of data.table regularly.

BTW, I install the dev version of data.table as the instruction in https://github.com/Rdatatable/data.table/wiki/Installation:

remove.packages("data.table")                         # First remove the current version
install.packages("data.table", type = "source",
    repos = "http://Rdatatable.github.io/data.table") # Then install devel version
@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 25, 2016

Seems like a shortcoming is the limits to the encoding parameter to fread, which cannot accept "GB2312" at the moment.

However, the following seems to work:

DT <- fread("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
            encoding = "unknown")

DT[ , PL_Type := iconv(PL_Type, "GB2312", "UTF-8")]
setkey(DT, PL_Type)
DT[J("公允价值变动损益")]
#              PL_Type HS_Port_Code
#  1: 公允价值变动损益         2042
#  2: 公允价值变动损益         2013
#  3: 公允价值变动损益         2032
#  4: 公允价值变动损益         2052
#  5: 公允价值变动损益         2035
#  6: 公允价值变动损益         2022
#  7: 公允价值变动损益         2015
#  8: 公允价值变动损益         2025
#  9: 公允价值变动损益         2023
# 10: 公允价值变动损益         2012
# 11: 公允价值变动损益         2055
# 12: 公允价值变动损益         8212
# 13: 公允价值变动损益         8222
# 14: 公允价值变动损益         2045

The iconv part doesn't seem too expensive -- perhaps we can just have fread do this iconv step under the hood if the encoding is something atypical?

DT <- fread("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
            encoding = "unknown")

DTN <- rbindlist(lapply(integer(1e5), function(...) DT))

system.time(DTN[ , PL_Type := iconv(PL_Type, "GB2312", "UTF-8")])
#    user  system elapsed 
#   3.056   0.000   3.055 

In fact, read.csv's handling of fileEncoding appears tightly related to iconv; from ?read.table:

The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv

iconvlist() will also be helpful for flagging inappropriate inputs.


Barring that implementation, a note in ?fread regarding strange encodings and the utility of iconv could suffice.

@shrektan
Copy link
Member Author

@MichaelChirico Sorry, I don't understand why this issue has any relation to fread... BTW, as I mentioned above, the same issue happens when fetching from a database...

@MichaelChirico
Copy link
Member

Because ideally, the encoding issue would be handled immediately upon incorporating the data into R.

To me, an ideal workflow for this would be:

DT <- fread("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
            encoding = "GB2312", key = "PL_Type")
DT[.("公允价值变动损益")]

@shrektan
Copy link
Member Author

shrektan commented Aug 26, 2016

@MichaelChirico Yes, for csv the fread misses the support to arbitrage encoding like read.csv.

However, I don't think this issue itself is directly related to fread, because not only for csv files but also for the data read from database, which is not able to be handled by fread... Also, the case itself is rather complicated as I described above, it seems like there's something wrong when implementing the support to different encoding character index, like 03cd45f .

But I don't think it's this commit 03cd45f that causes this case, since it was committed in Jan. So, my guess is it's related to some internal changes for encoding after that.

@arunsrinivasan Please take a look on this issue... Thanks.

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Aug 26, 2016

On OS X 10.11.6, Ubuntu 14 and 16, I get this:

> library(data.table)
data.table 1.9.7 IN DEVELOPMENT built 2016-08-26 16:52:35 UTC
For help type ?data.table or https://github.com/Rdatatable/data.table/wiki
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
> dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
+                stringsAsFactors = FALSE, fileEncoding = "GB2312")
> setDT(dt)
> setkey(dt, PL_Type)
> dt[J("公允价值变动损益")]
             PL_Type HS_Port_Code
 1: 公允价值变动损益         2042
 2: 公允价值变动损益         2013
 3: 公允价值变动损益         2032
 4: 公允价值变动损益         2052
 5: 公允价值变动损益         2035
 6: 公允价值变动损益         2022
 7: 公允价值变动损益         2015
 8: 公允价值变动损益         2025
 9: 公允价值变动损益         2023
10: 公允价值变动损益         2012
11: 公允价值变动损益         2055
12: 公允价值变动损益         8212
13: 公允价值变动损益         8222
14: 公允价值变动损益         2045

Could you please edit your session_info() with the normal sessionInfo() output, which gives a nicer platform, running under and locale info?

@MichaelChirico
Copy link
Member

Hmm that's odd. I swear when this was posted I was getting the same thing as OP on Linux Mint (over Ubuntu 14.04)... don't think I've touched my install since then (startup says built 2016-08-20 14:14:50 UTC)...

@shrektan
Copy link
Member Author

@arunsrinivasan Sorry, I didn't make it clear. I tested it under win7. Below is the new test code using sessionInfo

Under current dev version (1.9.7) of data.table

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.7
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.3.1     htmltools_0.3.5 Rcpp_0.12.4    
##  [5] stringi_1.1.1   rmarkdown_1.0   knitr_1.14      stringr_1.1.0  
##  [9] digest_0.6.10   evaluate_0.9

Under CRAN version (1.9.6) of data.table

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
## Warning in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends,
## nomatch, : A known encoding (latin1 or UTF-8) was detected in a join
## column. data.table compares the bytes currently, so doesn't support *mixed*
## encodings well; i.e., using both latin1 and UTF-8, or if any unknown
## encodings are non-ascii and some of those are marked known and others
## not. But if either latin1 or UTF-8 is used exclusively, and all unknown
## encodings are ascii, then the result should be ok. In future we will check
## for you and avoid this warning if everything is ok. The tricky part is
## doing this without impacting performance for ascii-only cases.
##             PL_Type HS_Port_Code
## 1: 公允价值变动损益           NA
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.6
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.3.1     htmltools_0.3.5 Rcpp_0.12.6    
##  [5] stringi_1.1.1   rmarkdown_1.0   knitr_1.14      stringr_1.1.0  
##  [9] digest_0.6.10   chron_2.3-47    evaluate_0.9

@shrektan
Copy link
Member Author

Also, the code run's a different output in my Mac. I guess it's because the native encoding is not UTF-8 in windows.

Result in Mac OS X with data.table 1.9.7

library(data.table)
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
unique(Encoding(dt$PL_Type))
## [1] "unknown"
dt[, PL_Type := enc2utf8(PL_Type)]
setkey(dt, PL_Type)
unique(Encoding(dt$PL_Type))
## [1] "UTF-8"
dt[J("公允价值变动损益")]
##              PL_Type HS_Port_Code
##  1: 公允价值变动损益         2042
##  2: 公允价值变动损益         2013
##  3: 公允价值变动损益         2032
##  4: 公允价值变动损益         2052
##  5: 公允价值变动损益         2035
##  6: 公允价值变动损益         2022
##  7: 公允价值变动损益         2015
##  8: 公允价值变动损益         2025
##  9: 公允价值变动损益         2023
## 10: 公允价值变动损益         2012
## 11: 公允价值变动损益         2055
## 12: 公允价值变动损益         8212
## 13: 公允价值变动损益         8222
## 14: 公允价值变动损益         2045
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.7
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.3.1     htmltools_0.3.5 Rcpp_0.12.6    
##  [5] stringi_1.1.1   rmarkdown_1.0   knitr_1.14      stringr_1.1.0  
##  [9] digest_0.6.10   evaluate_0.9

@shrektan
Copy link
Member Author

Sorry, but can anyone reproduce this?

@shrektan
Copy link
Member Author

shrektan commented Sep 9, 2016

@arunsrinivasan Sorry, I know you're very busy. However, I personally think it's an important issue for data.table users who use non-ASCII characters in Windows, since all of them will meet the same issue...

I don't have the expertise to fix the problem... So, when you're available, please take a look on this...

Thanks so much.

@jangorecki
Copy link
Member

jangorecki commented Sep 9, 2016

On ubuntu and 3.3.1 it works as on Mac, probably something windows related

library(data.table)
#data.table 1.9.7 IN DEVELOPMENT built 2016-09-01 21:24:37 UTC; travis
dt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)
dt[J("公允价值变动损益")]
#             PL_Type HS_Port_Code
# 1: 公允价值变动损益         2042
# 2: 公允价值变动损益         2013
# 3: 公允价值变动损益         2032
# 4: 公允价值变动损益         2052
# 5: 公允价值变动损益         2035
# 6: 公允价值变动损益         2022
# 7: 公允价值变动损益         2015
# 8: 公允价值变动损益         2025
# 9: 公允价值变动损益         2023
#10: 公允价值变动损益         2012
#11: 公允价值变动损益         2055
#12: 公允价值变动损益         8212
#13: 公允价值变动损益         8222
#14: 公允价值变动损益         2045

@MichaelChirico
Copy link
Member

yes, please be sure to report your system specs if you can't get it to work
on development.

On Sep 8, 2016 10:10 PM, "Jan Gorecki" [email protected] wrote:

On ubuntu and 3.3.1 it does work out of the box

library(data.table)#data.table 1.9.7 IN DEVELOPMENT built 2016-09-01 21:24:37 UTC; travisdt <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dt)
setkey(dt, PL_Type)dt[J("公允价值变动损益")]# PL_Type HS_Port_Code# 1: 公允价值变动损益 2042# 2: 公允价值变动损益 2013# 3: 公允价值变动损益 2032# 4: 公允价值变动损益 2052# 5: 公允价值变动损益 2035# 6: 公允价值变动损益 2022# 7: 公允价值变动损益 2015# 8: 公允价值变动损益 2025# 9: 公允价值变动损益 2023#10: 公允价值变动损益 2012#11: 公允价值变动损益 2055#12: 公允价值变动损益 8212#13: 公允价值变动损益 8222#14: 公允价值变动损益 2045


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1826 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdeq2T2-wBFv6D509bHbp_Rm0cI_Fks5qoMAVgaJpZM4JsyR6
.

@shrektan
Copy link
Member Author

shrektan commented Sep 20, 2016

@arunsrinivasan @jangorecki @MichaelChirico I have put my system session info above #1826 (comment)

I tried to install different commit of data.table to see which one causes this bug, and found that the commit is 03cd45f

Please help me when you have the time (And please remove the label not reproducible, since I can reproduce it in my colleague's computer... I'm pretty sure it can be reproduced in every windows machine as long as the default encoding is not UTF-8). I really appreciate that! Thanks again.

Here's the session info again:

d> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.7

loaded via a namespace (and not attached):
[1] rsconnect_0.4.3 tools_3.3.1     withr_1.0.2     memoise_1.0.0  
[5] digest_0.6.10   devtools_1.12.0

d> devtools::session_info()
Session info -------------------------------------------------------------
 setting  value                                              
 version  R version 3.3.1 (2016-06-21)                       
 system   i386, mingw32                                      
 ui       RStudio (0.99.902)                                 
 language (EN)                                               
 collate  Chinese (Simplified)_People's Republic of China.936
 tz       Asia/Taipei                                        
 date     2016-09-20                                         

Packages -----------------------------------------------------------------
 package    * version date       source        
 data.table * 1.9.7   2016-09-09 local         
 devtools     1.12.0  2016-06-24 CRAN (R 3.3.1)
 digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
 memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
 rsconnect    0.4.3   2016-05-02 CRAN (R 3.2.5)
 withr        1.0.2   2016-06-20 CRAN (R 3.2.5)

@shrektan
Copy link
Member Author

BTW, the R version might matter, since the function base::match was modified in R3.3.0 and there's a bug that has been fixed in R3.3.1 (see https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885)

@shrektan
Copy link
Member Author

Well, can anybody take a look on this? or should I open a new issue?

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Sep 25, 2016
@jangorecki
Copy link
Member

I doubt opening another issue will help, at least as long as this one is still open. If you are in rush, you can always use a fork until it is resolved in master, this is a common practice, not just in R, but generally in open source projects. There is nothing wrong about it. Many companies modify open source projects to better fits their needs.

@shrektan
Copy link
Member Author

@jangorecki thanks for the advices. 👍

@shrektan
Copy link
Member Author

shrektan commented Mar 14, 2017

@arunsrinivasan @jangorecki First of all, thanks for your intention on this issue. After some experiments, I think I locate the root of why this issue happens.

It's because the strings might have different orders under different encoding.

See the example below:

library(data.table)
dtRaw <- read.csv("https://raw.githubusercontent.com/shrektan/temporary/master/data.csv",
               stringsAsFactors = FALSE, fileEncoding = "GB2312")
setDT(dtRaw)
dt <- data.table(CN = unique(dtRaw$PL_Type), VALUE = seq_len(uniqueN(dtRaw$PL_Type)))
setkey(dt, CN)

dt2 <- data.table(CN = enc2utf8(unique(dtRaw$PL_Type)), VALUE = seq_len(uniqueN(dtRaw$PL_Type)))
setkey(dt2, CN)

print(dt)
##                   CN VALUE
##  1: 公允价值变动损益     1
##  2:         红利收入     2
##  3:         汇兑损益    10
##  4:         价差收入     3
##  5:         交易费用     9
##  6:         利息支出     5
##  7:         利息收入     4
##  8:     其他业务支出     7
##  9:   营业税金及附加     6
## 10:     资产减值损失     8
print(dt2)
##                   CN VALUE
##  1:         交易费用     9
##  2:         价差收入     3
##  3: 公允价值变动损益     1
##  4:     其他业务支出     7
##  5:         利息支出     5
##  6:         利息收入     4
##  7:         汇兑损益    10
##  8:         红利收入     2
##  9:   营业税金及附加     6
## 10:     资产减值损失     8
dt[J("公允价值变动损益")]
##                  CN VALUE
## 1: 公允价值变动损益    NA
dt2[J("公允价值变动损益")]
##                  CN VALUE
## 1: 公允价值变动损益     1

The sessionInfo, I strongly believe that only occurs on windows platform:

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936  LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] tools_3.3.2          withr_1.0.2          memoise_1.0.0        digest_0.6.12        devtools_1.12.0.9000

What data.table currently do is to compare strings in UTF-8 encoding, after being set keys, and using the binary search, so...

So the fix I guess should be: When setkey() for data.table objects, order them in UTF-8 encodings, instead of raw encoding.


UPDATED Actually after the script above, if you do setkey(dt, CN) again, you will get warning :

Warning message:
In setkeyv(x, cols, verbose = verbose, physical = physical) :
  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

And I though it should have been fixed by 409d709

But it didn't... I have no clue now... 😭

Thanks.

@shrektan
Copy link
Member Author

shrektan commented May 2, 2017

I know that your guys are very busy. However, can anyone take a look? Maybe you can provide me some hints so that I can help to solve this issue? I will be very appreciated. Thanks.

@shrektan
Copy link
Member Author

shrektan commented Nov 3, 2017

Closes because the conversation here is quite confused.

@shrektan shrektan closed this as completed Nov 3, 2017
@mattdowle mattdowle modified the milestones: Candidate, v1.11.0 May 10, 2018
@shrektan shrektan added the encoding issues related to Encoding label Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding issues related to Encoding
Projects
None yet
Development

No branches or pull requests

5 participants