Skip to content

Commit

Permalink
fread drop= last column when last field is quoted and contains sep too (
Browse files Browse the repository at this point in the history
  • Loading branch information
mattdowle authored Nov 8, 2017
1 parent 7db217b commit 1b8e886
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 7 deletions.
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
* Now handles floating-point NaN values in a wide variety of formats, including `NaN`, `sNaN`, `1.#QNAN`, `NaN1234`, `#NUM!` and others, [#1800](https://github.com/Rdatatable/data.table/issues/1800). Thanks to Jori Liesenborgs for highlighting and the PR.
* If negative numbers are passed to `select=` the out-of-range error now suggests `drop=` instead, [#2423](https://github.com/Rdatatable/data.table/issues/2423). Thanks to Michael Chirico for the suggestion.
* `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` may be deprecated in future.
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446)
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446), [#2464](https://github.com/Rdatatable/data.table/issues/2464)

2. `fwrite()`:
* empty strings are now always quoted (`,"",`) to distinguish them from `NA` which by default is still empty (`,,`) but can be changed using `na=` as before. If `na=` is provided and `quote=` is the default `'auto'` then `quote=` is set to `TRUE` so that if the `na=` value occurs in the data, it can be distinguished from `NA`. Thanks to Ethan Welty for the request [#2214](https://github.com/Rdatatable/data.table/issues/2214) and Pasha for the code change and tests, [#2215](https://github.com/Rdatatable/data.table/issues/2215).
Expand Down
30 changes: 28 additions & 2 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -10759,7 +10759,7 @@ test(1779.8, identical(as.IDate.default(p), as.IDate(p)))

# R 3.0.1 had the following bug fix in R News :
# " as.POSIXct.numeric was coercing origin using the tz argument and not "GMT" as documented (PR#14973) "
# So tz="UTC" is required on next line for test 1779.9 to pass R 3.0.0 (current stated dependency).
# So tz="UTC" is required on next line for test 1779.9 to pass R 3.0.0 (current stated dependency).
# Test 1779.9 would be fine without the tz="UTC" from R 3.0.1 onwards.
p = as.POSIXct(i, origin = "1970-01-01", tz="UTC")
test(1779.9, identical(as.ITime.default(p), as.ITime(p)))
Expand Down Expand Up @@ -11085,7 +11085,33 @@ DT2 <- fread("issue_2275_dt2.csv")[, DT2_ID := .I][, (cols) := lapply(.SD, as.D
ans1 <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L
tmp <- DT1[DT2, on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE), which=TRUE, nomatch=0L]
ans2 <- DT1[, DT1_ID %in% tmp]
test(1850.1, ans1, ans2)
test(1848, ans1, ans2)


# when last field is quoted contains sep and select= is used too, #2464
test(1849.1, fread('Date,Description,Amount,Balance\n20150725,abcd,"$3,004","$5,006"', select=c("Date", "Description", "Amount")),
data.table(Date=20150725L,Description="abcd",Amount="$3,004"))
test(1849.2, fread('Date,Description,Amount,Balance\n20150725,abcd,"$3,004","$5,006"', select=c("Date", "Description", "Balance")),
data.table(Date=20150725L,Description="abcd",Balance="$5,006"))
test(1849.3, fread('Date,Description,Amount,Balance\n20150725,abcd,"$3,004","$5,006"\n', select=c("Date", "Description", "Amount")),
data.table(Date=20150725L,Description="abcd",Amount="$3,004"))
test(1849.4, fread('Date,Description,Amount,Balance\n20150725,abcd,"$3,004","$5,006"\n', select=c("Date", "Description", "Balance")),
data.table(Date=20150725L,Description="abcd",Balance="$5,006"))
cat('Date,Description,Amount,Balance\n20150725,abcd,"$3,004","$5,006"', file=f<-tempfile())
test(1849.5, fread(f,verbose=TRUE),
data.table(Date=20150725L,Description="abcd",Amount="$3,004", Balance="$5,006"),
output="File ends abruptly with '\"'")
test(1849.6, fread(f, select=c("Date", "Description", "Amount")),
data.table(Date=20150725L,Description="abcd",Amount="$3,004"))
test(1849.7, fread(f, select=c("Date", "Description", "Balance")),
data.table(Date=20150725L,Description="abcd",Balance="$5,006"))
cat('\n', file=f, append=TRUE)
test(1849.8, fread(f, select=c("Date", "Description", "Amount")),
data.table(Date=20150725L,Description="abcd",Amount="$3,004"))
test(1849.9, fread(f, select=c("Date", "Description", "Balance")),
data.table(Date=20150725L,Description="abcd",Balance="$5,006"))
unlink(f)


##########################

Expand Down
11 changes: 7 additions & 4 deletions src/fread.c
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,10 @@ static void Field(FieldParseContext *ctx)
*(ctx->ch) = ch;
} else {
*(ctx->ch) = ch;
if (*ch=='\0' && quoteRule!=2) { target->off--; target->len++; } // test 1324 where final field has open quote but not ending quote; include the open quote like quote rule 2
if (*ch=='\0') {
if (finalByte==quote && ch==eof) return; // test 1849.* for issue 2464 where final quoted field contains a sep and no \n ending the file
if (quoteRule!=2) { target->off--; target->len++; } // test 1324 where final field has open quote but not ending quote; include the open quote like quote rule 2
}
if (stripWhite) while(target->len>0 && ch[-1]==' ') { target->len--; ch--; } // test 1551.6; trailing whitespace in field [67,V37] == "\"\"A\"\" ST "
}
}
Expand Down Expand Up @@ -1989,9 +1992,9 @@ int freadMain(freadMainArgs _args) {
if (j==ncol) { tch++; myNrow++; continue; } // next line. Back up to while (tch<nextJump). Usually happens, fastest path
}
else {
tch = fieldStart; // restart field as int processor could have moved to A in ",123A,", or we could be on \0 when we always reread last field regardless
tch = fieldStart; // restart field as int processor could have moved to A in ",123A,"
}
// if *tch=='\0' then fall through below and reread final field if finalByte is set
// if *tch=='\0' then *eof in mind, fall through to below and, if finalByte is set, reread final field
}
//*** END TEPID. NOW COLD.

Expand All @@ -2016,7 +2019,7 @@ int freadMain(freadMainArgs _args) {
while (absType < NUMTYPE) {
tch = fieldStart;
bool quoted = false;
if (absType < CT_STRING) {
if (absType<CT_STRING && absType>CT_DROP/*Field() too*/) {
skip_white(&tch);
const char *afterSpace = tch;
tch = end_NA_string(fieldStart);
Expand Down

0 comments on commit 1b8e886

Please sign in to comment.