-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parameter to fread to read numbers with leading zeros as character #2999
Comments
i can second that request from my experience with CSV files. Quite often numeric ids/keys in a source database are exported with leading zeros. Because thats their data dictionary definition in the database. E.g. a customer number is 123456 and it is defined as 10 digits in the db. In the CSV export this would be done as 0000123456. Most times there is no use at all for a conversion into a numeric column of a customer number when imported. Only culprit i can see is a column containing just a 0. Should be treated as numeric? Or is this a leading 0 for ? |
are you able to insert a row having id 9999999999+1 and see what csv extract will look like? characters are safer to store ID as you won't hit int32 limit. They are safer in terms of in/out data from db-other tools, inside system/db integers should be preferred. |
If there is an int that is too large to fit into int64, the entire column will be converted into character type (and in that case leading zeros are preserved):
The only time when the problem arises is when all IDs are sufficiently small, and have leading 0s. |
The issue is not only related to id's. We deal with data that includes NDC codes which contain leading zeros. One of the data sets we get from the National Cancer Institute has a table that contains over 2000 variables. Some of those variables contain coded data with leading 0's (01, 02, etc) so manually adding colClasses to all them would be difficult @st-pasha For all the data I've encountered I don't think I've run into ids bigger than int64 allows so for all of my cases I'll never have fread pick the right column type for my data with leading zero's. It just seems like if fread is going to guess column types it shouldn't remove data (leading zero's) especially without warning the user that it did that. I'm curious if any of you have run into data with leading zero's that you would want fread to read in as a numeric? |
We thought about these issues, and even contemplated adding an option, but in the end, it fell off from our radars somehow. I've added a [beginner-task] label because this is a task that is actually quite easy to do: look into fread.c for functions |
IMO this should not be default behavior. Or at least it should be easy to keep leading zero with |
I do understand many pros and cons. Right an
@marc-outins - I can agree in the last years of all columns with leading 0 I had seen, none were meant to be numeric. |
@gsgxnet your last point I think is the best take, and the best motivation for this feature/argument/improvement. Regarding the inconsistency across many sources, one is always free to store And I think more important is better upstream data management (wherever possible). |
I'm working on implementing this feature. I've come around to the idea of keeping the current default behavior of dropping the leading zeros and reading as numeric since the base R readers read.table, read.csv, etc have this behavior. I've implemented a solution that allows the user to set an option data.table.fread.keepLeadingZeros to be TRUE (default will be FALSE) if they want fread to read data with leading zeros as character. (see marc-outins@60d6653). Still needs more work but this is the general idea. I also still need to add tests and would like to add a warning to let the user know something happened when data.table.fread.keepLeadingZeros==FALSE and a column with leading zero's is stored as numeric. I'm curious if people prefer this solution or adding a specific parameter to fread? There is the logical01 parameter which is in the same vain as the leading zero option so it may make sense. |
@marc-outins many thanks for your good effort. Yes, please keep the current default behavior, otherwise existing code might be broken. |
This just caused an issue for me when trying to read in a file including zip codes and then broke the join with data from a database. Cutting off leading zeroes is one of the reasons I don't use Excel. I typically expect R to not have fickle behavior such as changing my data by making assumptions about it, so I'm lending my voice to the "this should be default behavior" position. For now, I switched to read_csv() for this process. |
@johncassil use |
As we can use |
I think it's more complicated... if file is like
then I think it's clear character was intended and leading 0 shouldn't be dropped. Unquoted case is a tougher call:
But probably good default to drop leading 0 here. Upstream data producer can be told to add quotes if it's possible, otherwise Problem with
In this case, shared triplet of leading zeros will be dropped and user might not know how many 0s are needed to pad. A tad problematic if reading a folder of files which may truncate differently... overselling a bit since I imagine in most cases where this actually matters, user should know the intended code length. But anyway using |
While dropping leading zeros by default seems like the right (and more consistent) solution, not having the option to keep them is a source of pain for my team. The data we use every day is filled with numeric identifiers with leading zeros. As it stands we have three options: specify the class of every variable in every file, load everything as character and coerce individual columns afterwards, or string pad all ID variables with zeros. All three of these are all over our codebase, but they all require extra code and are easy to forget. If we could instead set a single |
@ejoranlienea thanks, this is valuable input. Do you have any sample files you could share? Out-in-the-wild examples are always appreciated for understanding end use cases 👍 |
@MichaelChirico Sure, here is some fake data roughly representative of a common type of file we get. It's a caret-delimited extract with no quoting and a variety of data types. We have no control over the source system, so we can't request any format changes that might be helpful. In this data, I generated 1000 rows, but the first 20 or so should be sufficient to show the issue. To load this data, we would generally either load with |
I just updated to the newest
option in that version. Looks like is is only available in your fork, @marc-outins. Are there plans to have it in the official release? It would be a big step forward because having to set sometimes many many columns by hand to the right format just to get those with leading Code used to test for the availability of the option:
|
@gsgxnet sorry I got super busy with work and never finished the steps to contribute it to the main build. It's been a little while but I believe I have a working solution, I just need to add tests. I'll look into finishing this up in the next week and try to at least get my github fork updated with the changes. |
@gsgxnet I updated my fork on github to catch up to the latest version of data.table (1.12.1) and pushed the branch fread_keepLeadingZeros which passed all of data.tables tests except for a warning and a note (see below):
I'll try to write some tests and QC it some more in the next week. Thanks -Marc |
@marc-outins - please excuse me, I did never intend to push you. I appreciate very much your work for a public domain / open source package. The community depends on people like you and sorry I am not good enough in C or C++ development to contribute too. Many thanks for your effort. Lets hope the pull request will be accepted. Does not look so in the moment. What do you think? |
@gsgxnet No need to apologize, I had already done most of the work late last year and definitely needed to push to finish it so thank you. To be honest this is my first pull request for data.table, I tried to follow the contribution guide as best I could. At the very least it passed all the ci tests so that's a good sign. |
Add a parameter or option to not ignore leading zeros when reading data with fread. I have data that contains numeric looking data that has leading zeros that I would like fread to read as character. I've seen other people asking about this on stack overflow and the some of the main responses are to set colClasses = "character" so all columns are read as character or specifically call out the columns that need to be read as character. These options aren't great if there are lots of columns with this issue along with other columns that should be read in as non-character. I've never dealt with data that looks like "0300" that really represents 300 so I would by default like to read columns that contain data with leading 0's as character.
#
Minimal Reproducible Example#
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: