Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter to fread to read numbers with leading zeros as character #2999

Closed
marc-outins opened this issue Aug 13, 2018 · 22 comments
Closed

Comments

@marc-outins
Copy link

Add a parameter or option to not ignore leading zeros when reading data with fread. I have data that contains numeric looking data that has leading zeros that I would like fread to read as character. I've seen other people asking about this on stack overflow and the some of the main responses are to set colClasses = "character" so all columns are read as character or specifically call out the columns that need to be read as character. These options aren't great if there are lots of columns with this issue along with other columns that should be read in as non-character. I've never dealt with data that looks like "0300" that really represents 300 so I would by default like to read columns that contain data with leading 0's as character.

# Minimal Reproducible Example

library(data.table)

dt <- fread(input = "0300\n")
str(dt)

Classes ‘data.table’ and 'data.frame':	1 obs. of  1 variable:
 $ V1: int 300
 - attr(*, ".internal.selfref")=<externalptr>

# Output of sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doSNOW_1.0.16     snow_0.4-2        iterators_1.0.10  foreach_1.4.4     purrr_0.2.5       stringi_1.2.4    
 [7] stringr_1.3.1     fuzzyjoin_0.1.4   data.table_1.11.4 rvest_0.3.2       xml2_1.2.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     compiler_3.5.1   pillar_1.3.0     bindr_0.1.1      tools_3.5.1      digest_0.6.15   
 [7] evaluate_0.11    tibble_1.4.2     pkgconfig_2.0.1  rlang_0.2.1      rstudioapi_0.7   curl_3.2        
[13] yaml_2.1.19      bindrcpp_0.2.2   dplyr_0.7.6      httr_1.3.1       knitr_1.20       hms_0.4.2       
[19] rprojroot_1.3-2  tidyselect_0.2.4 glue_1.3.0       R6_2.2.2         rmarkdown_1.10   readr_1.1.1     
[25] selectr_0.4-1    magrittr_1.5     backports_1.1.2  codetools_0.2-15 htmltools_0.3.6  assertthat_0.2.0
[31] crayon_1.3.4    
@gsgxnet
Copy link

gsgxnet commented Sep 13, 2018

i can second that request from my experience with CSV files. Quite often numeric ids/keys in a source database are exported with leading zeros. Because thats their data dictionary definition in the database. E.g. a customer number is 123456 and it is defined as 10 digits in the db. In the CSV export this would be done as 0000123456. Most times there is no use at all for a conversion into a numeric column of a customer number when imported.

Only culprit i can see is a column containing just a 0. Should be treated as numeric? Or is this a leading 0 for ?

@jangorecki
Copy link
Member

are you able to insert a row having id 9999999999+1 and see what csv extract will look like? characters are safer to store ID as you won't hit int32 limit. They are safer in terms of in/out data from db-other tools, inside system/db integers should be preferred.

@st-pasha
Copy link
Contributor

If there is an int that is too large to fit into int64, the entire column will be converted into character type (and in that case leading zeros are preserved):

> fread("A\n001\n12345678901234567890")
                      A
                 <char>
1:                  001
2: 12345678901234567890

The only time when the problem arises is when all IDs are sufficiently small, and have leading 0s.

@marc-outins
Copy link
Author

The issue is not only related to id's. We deal with data that includes NDC codes which contain leading zeros. One of the data sets we get from the National Cancer Institute has a table that contains over 2000 variables. Some of those variables contain coded data with leading 0's (01, 02, etc) so manually adding colClasses to all them would be difficult

@st-pasha For all the data I've encountered I don't think I've run into ids bigger than int64 allows so for all of my cases I'll never have fread pick the right column type for my data with leading zero's.

It just seems like if fread is going to guess column types it shouldn't remove data (leading zero's) especially without warning the user that it did that. I'm curious if any of you have run into data with leading zero's that you would want fread to read in as a numeric?

@st-pasha
Copy link
Contributor

We thought about these issues, and even contemplated adding an option, but in the end, it fell off from our radars somehow.
The use case in favor of retaining 0s are zip-codes and user IDs. The use case in favor of removing zeros is line numbers in text files (001, 002, etc), and numbers formatted with printf("%0Nd"). I believe the "remove zeros" viewpoint won because a) if it looks like a number then we should treat it like a number; and b) the integers are faster to parse and also take less space in memory -- so for users who don't care about the representation, the integer choice is more preferable.

I've added a [beginner-task] label because this is a task that is actually quite easy to do: look into fread.c for functions StrtoI32 and StrtoI64. There is even a comment there about leading 0s, but it seems to be outdated.

@jangorecki
Copy link
Member

IMO this should not be default behavior. Or at least it should be easy to keep leading zero with colClasses=c(id="character"). Whoever will implement that please include test for that.

@gsgxnet
Copy link

gsgxnet commented Sep 15, 2018

I do understand many pros and cons. Right an int64 is principally okay as format for a numeric code/id/ when read, and probably it needs less storage most times.
If it just were this one csv. But in real life we have to cope with many sources which handle codes/ids different. E.g. if you read some data by jdbc they will probably be transferred as character, other data from xls it depends, may be number, may be character. If these codes/ids were needed for joins between the different data sources the format has to be unified. No fun.

csv is is a quite poor data format. Much metadata is lost when db tables are exported into it. Leading 0 might be a hint for the right reconstruction of missing metadata.

@marc-outins - I can agree in the last years of all columns with leading 0 I had seen, none were meant to be numeric.

@MichaelChirico
Copy link
Member

@gsgxnet your last point I think is the best take, and the best motivation for this feature/argument/improvement.

Regarding the inconsistency across many sources, one is always free to store my_colclasses = list(character = 'my_id_col') and pass that to any number of input CSVs.

And I think more important is better upstream data management (wherever possible).

@marc-outins
Copy link
Author

I'm working on implementing this feature. I've come around to the idea of keeping the current default behavior of dropping the leading zeros and reading as numeric since the base R readers read.table, read.csv, etc have this behavior.

I've implemented a solution that allows the user to set an option data.table.fread.keepLeadingZeros to be TRUE (default will be FALSE) if they want fread to read data with leading zeros as character. (see marc-outins@60d6653). Still needs more work but this is the general idea. I also still need to add tests and would like to add a warning to let the user know something happened when data.table.fread.keepLeadingZeros==FALSE and a column with leading zero's is stored as numeric.

I'm curious if people prefer this solution or adding a specific parameter to fread? There is the logical01 parameter which is in the same vain as the leading zero option so it may make sense.

@gsgxnet
Copy link

gsgxnet commented Sep 20, 2018

@marc-outins many thanks for your good effort. Yes, please keep the current default behavior, otherwise existing code might be broken.
I would love to have a new parameter to fread because that way it could be used very flexible with every fread statement within the coding of one script/notebook/... containing several freads. But using an option should work there too, may be with some extra effort.
There are many parameters in fread() already, may be one more might be regarded as overload. If not my vote is for option and a parameter with default = option.

@johncassil
Copy link

This just caused an issue for me when trying to read in a file including zip codes and then broke the join with data from a database. Cutting off leading zeroes is one of the reasons I don't use Excel. I typically expect R to not have fickle behavior such as changing my data by making assumptions about it, so I'm lending my voice to the "this should be default behavior" position. For now, I switched to read_csv() for this process.

@MichaelChirico
Copy link
Member

@johncassil use colClasses to force that column to be read as a character for now please :)

@jangorecki
Copy link
Member

As we can use colClasses mentioned my Michael, this issue is now more about default behaviour. I am putting my vote for dropping leading zeros by default. Thus no changes are required, we are more consistent to base R, and IMO this is a proper way, if someone wants leading zeros later on it can be easily achieved by "sprintf" function, or just "colClasses" at start.

@jangorecki jangorecki added this to the 1.12.2 milestone Jan 16, 2019
@MichaelChirico
Copy link
Member

I think it's more complicated... if file is like

id,zip_code
1,"01234"
2,"19103"
3,"64890"
4,"01023"

then I think it's clear character was intended and leading 0 shouldn't be dropped.

Unquoted case is a tougher call:

id,zip_code
1,01234
2,19103
3,64890
4,01023

But probably good default to drop leading 0 here. Upstream data producer can be told to add quotes if it's possible, otherwise colClasses approach works to declare intentionality.

Problem with sprintf approach is there could be unintentional data loss:

id,my_code
1,00001234
2,00019103
3,00064890
4,00001023

In this case, shared triplet of leading zeros will be dropped and user might not know how many 0s are needed to pad. A tad problematic if reading a folder of files which may truncate differently... overselling a bit since I imagine in most cases where this actually matters, user should know the intended code length. But anyway using sprintf and max(nchar()) approach is a bit awkward.

@ejoranlienea
Copy link

While dropping leading zeros by default seems like the right (and more consistent) solution, not having the option to keep them is a source of pain for my team.

The data we use every day is filled with numeric identifiers with leading zeros. As it stands we have three options: specify the class of every variable in every file, load everything as character and coerce individual columns afterwards, or string pad all ID variables with zeros. All three of these are all over our codebase, but they all require extra code and are easy to forget.

If we could instead set a single keepLeadingZeros parameter, it would save us time and debugging. Without the option, we can't take advantage of fread's type detection, and we generally train new employees to always set colClasses = 'character'.

@MichaelChirico
Copy link
Member

@ejoranlienea thanks, this is valuable input. Do you have any sample files you could share? Out-in-the-wild examples are always appreciated for understanding end use cases 👍

@ejoranlienea
Copy link

@MichaelChirico Sure, here is some fake data roughly representative of a common type of file we get. It's a caret-delimited extract with no quoting and a variety of data types. We have no control over the source system, so we can't request any format changes that might be helpful. In this data, district_code, school_code, and student_code all have values with leading zeros, and the remaining columns are a mix of character, integer, and date.

https://pastebin.com/FKGerT9a

I generated 1000 rows, but the first 20 or so should be sufficient to show the issue.

To load this data, we would generally either load with colClasses = 'character' and coerce the ints and dates, or let data.table determine types, coerce the IDs, and prepend zeros where necessary. In real data there would be many more columns, and the leading-zero fields would have inconsistent names and locations.

@gsgxnet
Copy link

gsgxnet commented Jan 17, 2019

I just updated to the newest 1.12.0 version of data.table from CRAN.
There seems to be still no

data.table.fread.keepLeadingZeros to be TRUE (default will be FALSE)

option in that version. Looks like is is only available in your fork, @marc-outins.

Are there plans to have it in the official release? It would be a big step forward because having to set sometimes many many columns by hand to the right format just to get those with leading 0 parsed right is much effort which could be avoided easily.

Code used to test for the availability of the option:
library(data.table)
options("data.table.fread.keepLeadingZeros"=TRUE)
t2 <- fread(text = "NAME,AGE,ZIP,KEY\nPAUL,11,43216,00011\nMARY,73,01234,99989\nPETER,92,62841,32834")
str(t2)

Classes ‘data.table’ and 'data.frame':	3 obs. of  4 variables:
 $ NAME: chr  "PAUL" "MARY" "PETER"
 $ AGE : int  11 73 92
 $ ZIP : int  43216 1234 62841
 $ KEY : int  11 99989 32834```

@marc-outins
Copy link
Author

@gsgxnet sorry I got super busy with work and never finished the steps to contribute it to the main build. It's been a little while but I believe I have a working solution, I just need to add tests. I'll look into finishing this up in the next week and try to at least get my github fork updated with the changes.

@marc-outins
Copy link
Author

marc-outins commented Jan 18, 2019

@gsgxnet I updated my fork on github to catch up to the latest version of data.table (1.12.1) and pushed the branch fread_keepLeadingZeros which passed all of data.tables tests except for a warning and a note (see below):

Duration: 4m 41.1s

> checking compiled code ... NOTE
  File 'data.table/libs/x64/datatable.dll':
    Found no calls to: 'R_registerRoutines', 'R_useDynamicSymbols'
  
  It is good practice to register native routines and to disable symbol
  search.
  
  See 'Writing portable packages' in the 'Writing R Extensions' manual.
   WARNING
  'qpdf' is needed for checks on size reduction of PDFs

0 errors √ | 1 warning x | 1 note x

I'll try to write some tests and QC it some more in the next week.

Thanks

-Marc

@gsgxnet
Copy link

gsgxnet commented Jan 18, 2019

@marc-outins - please excuse me, I did never intend to push you. I appreciate very much your work for a public domain / open source package. The community depends on people like you and sorry I am not good enough in C or C++ development to contribute too.

Many thanks for your effort. Lets hope the pull request will be accepted. Does not look so in the moment. What do you think?

@marc-outins
Copy link
Author

@gsgxnet No need to apologize, I had already done most of the work late last year and definitely needed to push to finish it so thank you. To be honest this is my first pull request for data.table, I tried to follow the contribution guide as best I could. At the very least it passed all the ci tests so that's a good sign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants