-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Ability to use OR statements within fread #2185
Comments
Possible duplicate/extension of #2066 |
Well spotted! #2066 is essentially the same idea. I ended up using a similar approach to account for the problem ('manually' re-writing the headers). This works but it's time consuming (kind of the opposite of data.table itself) and it could potentially not be applicable; e.g. for legal records, re-writing the field names could be construed as tampering with an original data set. Another approach is to explicitly state each header and load everything outside of a loop. Not too bad if it's only one or two, or ten. Awful when it's hundreds, thousands, hundreds of thousands. Being able to state two or more field names per field to import would be a very quick solution. With regards to this, I would also suggest possibly adding an in built field name surveying argument within fread. E.g. say if I have a directory with ten thousand files in it, the ability to simply point to the directory, have it pull all the headers and drop them into data.table with the file name next to it. An additional function could be to only show those that differ in some way. This is can already be done with fread, by looping through, getting the first nrows and binding it but it'd be even easier if there was some form of survey argument within fread that'd automate the process. I've done the above by writing some code myself then exported the results (for a smaller directory) to Excel to produce a pictogram showing the resulting structure; example pictogram, were each 'row' is the next file, file names are down the left, and each colour is a different field name, with the field names in the key on the right. This allows for a visual examination of what is going on within a directory; for example, the fact many of the coloured vertical lines don't line up indicates structural changes have occurred (as their shading is specific to a field name). Maybe there's some way to tie the output of a fread field name sweep directly into an R plot to produce something similar. Edit: The field names also change in the picogram example; e.g., 'REPT_DT', goes to 'rept_dt' and then ' rept_dt' (with a space before it), 'GNDR_COD', goes to 'gndr_cod', and then 'sex'. |
It would be handy if there was some way to use OR statements within fread.
Some of the information I work with has both variable column positions and field names; but there is only ever one form of a field name present at any one time.
As far as I'm aware, this isn't available with stock data.frames but perhaps there's some way to add the functionality in data.tables. I think it would require a consistent name to assigned first within the function; e.g.
select = c('a | a1 | a2', 'b | b1 | b2'),
col.names = c('a | a1 | a2' = 'a', 'b | b1 | b2' = 'b'),
colClasses = c(a = 'class', b = 'class'),
To avoid collisions it would only need to warn if it encounters more than one OR variable within a file.
The text was updated successfully, but these errors were encountered: