Do not suggests fread fill=TRUE if already used #2727

jangorecki · 2018-04-05T17:23:24Z

current fread behavior.

dt=fread("Rprofmem.out", header=FALSE)
#Warning message:
#In fread("Rprofmem.out", header = FALSE) :
#  Stopped early on line 2. Expected 3 fields but found 2. Consider fill=TRUE and comment.char=. First #discarded non-empty line: <<new page:"library" >>
dt=fread("Rprofmem.out", header=FALSE, fill=TRUE)
#Warning message:
#In fread("Rprofmem.out", header = FALSE, fill = TRUE) :
#  Stopped early on line 90. Expected 7 fields but found 8. Consider fill=TRUE and comment.char=. First #discarded non-empty line: <<new page:"close" "readRDS" "FUN" "lapply" "find.package" "system.file" #"library" >>

Before printing this warning we should check if fill=TRUE was used.

The text was updated successfully, but these errors were encountered:

st-pasha · 2018-04-05T18:24:36Z

If fill=TRUE then fread should add new columns as necessary. And then no error will be emitted.

jangorecki · 2018-04-06T03:13:42Z

Rprofmem.out.zip
There are other issues in this file, so don't try hard to parse it, attaching so issue can be reproduced.
profmem package also fail to parse it.

lotard · 2018-12-11T23:09:55Z

One up this, which can be reproduced using this simple toy example:

> body = paste0(rep("1 2\n", 1000), collapse="")
> main = paste0(body, "1 2 3\n", body, collapse="")
> fread(main, fill=T)
Warning message:
In fread(main, fill=T) :
  Stopped early on line 1001. Expected 2 fields but found 3.
  Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3>>

Issue seems to be incomplete sampling of possible number of columns. I don't necessarily think max(countfields(table)) is the right solution here (it is computationally expensive), but allowing the user to force the number of columns, if they know it (e.g. fill=3 in my example)

randomgambit · 2019-07-30T16:21:07Z

I have the same issue actually. Is there a workaround for now?
Thanks!

lotard · 2019-08-01T15:12:08Z

Find out the max number of fields, create a 'fake' line with that number of fields at the beginning of the file, read the file, scrap that line.

Befrancesco · 2019-12-18T15:58:44Z

I have the same issue, but my csv is 3.5 gb.
I have to modify it by powershell or is there an other way?

Thank you in advance.

gleesonger · 2020-06-19T10:11:04Z

Having the same issue, is there any work around which doesn't involve patching the start of the file contents?

andreas-sudo · 2021-03-16T08:33:41Z

I have the same issue.
In my case the file stops on line 1384.
It is a file I read from the webanalytics api from matomo. So not corrupted as such :-)

katiemarker · 2023-05-11T01:47:21Z

I'm still getting this same issue (May 2023), it looks like there is a pull request to fix this but it hasn't been implemented. If any of your rows are longer than the longest in some sample it is taking (maybe first 100 rows?) then it will give below warning, even when fill = T is already used, and stop reading at that line. fread throws an error if you try to put an integer with ncol guess.

Warning message:
In fread("file", :
Stopped early on line 165. Expected 34 fields but found 35. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<808,615,3261,608,1755,518,3469,3498,6624,495,317,6687,889,282,610,353,235,247,281,341,503,680,796,1012,6254,585,652,857,579,104,1771,859,881,768,1957>

Any idea of when this fix will get implemented? I'm using data.table v1.14.8

Here is the verbose read out if that is helpful:

test2 <- fread("file", header = F, stringsAsFactors = F, fill = T, verbose = T, sep = ",")

This installation of data.table has not been compiled with OpenMP support.
omp_get_num_procs() 1
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 1
omp_get_max_threads() 1
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 1 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open [01]
Check arguments
Using 1 threads (omp_get_max_threads()=1, nth=1)
NAstrings = [<>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file file
File opened, size = 42.86KB (43893 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<3818,3076,3273>>
[06] Detect separator, quoting rule, and ncolumns
Using supplied sep ',' sep=',' with 34 fields using quote rule 0
Detected 3 columns on line 1. This line is either column names or first data row. Line starts as: <<3818,3076,3273>>
Quote rule picked = 0
fill=true and the most number of columns found is 34
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 1 because (43892 bytes from row 1 to eof) / (2 * 2400 jump0size) == 9
Type codes (jump 000) : 5555555555555555555555555555555555
Quote rule 0 Type codes (jump 001) : 5555555555555555555555555555555555
Quote rule 0 ===== Sampled 200 rows (handled \n inside quoted fields) at 2 jump points
Bytes from first data row on line 1 to the end of last row: 43892 Line length: mean=15.57 sd=18.32 min=4 max=143 Estimated number of rows: 43892 / 15.57 = 2819 Initial alloc = 5638 rows (2819 + 100%) using bytes/max(mean-2sd,min) clamped between [1.1estn, 2.0*estn] =====
[08] Assign column names [09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5555555555555555555555555555555555
[10] Allocate memory for the datatable
Allocating 34 column slots (34 - 0 dropped) with 5638 rows
[11] Read the data jumps=[0..1), chunk_size=1048576, total_size=43892 Restarting team from jump 0. nSwept==0 quoteRule==1 jumps=[0..1), chunk_size=1048576, total_size=43892 Restarting team from jump 0. nSwept==0 quoteRule==2 jumps=[0..1), chunk_size=1048576, total_size=43892
Restarting team from jump 0. nSwept==0 quoteRule==3 jumps=[0..1), chunk_size=1048576, total_size=43892
Read 164 rows x 34 columns from 42.86KB (43893 bytes) file in 00:00.074 wall clock time
[12] Finalizing the datatable Type counts:
34 : int32 '5' ============================= 0.074s ( 99%) Memory map 0.000GB file 0.000s ( 0%) sep=',' ncol=34 and header detection 0.000s ( 0%) Column type detection using 200 sample rows 0.000s ( 0%) Allocation of 5638 rows x 34 cols (0.001GB) of which 164 ( 3%) rows used 0.000s ( 0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 164 rows) using 1 threads + 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times) + 0.000s ( 0%) Transpose + 0.000s ( 0%) Waiting 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions 0.074s Total

jangorecki · 2024-03-21T10:16:28Z

I wouldn't say it is closed. The behavior observed on the initially reported issue and current master is far from ideal. Now it suggests to use fill=10, but getting error again suggesting to use fill=11.

ben-schwen · 2024-03-21T11:19:57Z

True its raising the suggestion until fill=130L is reached where it finally reads the file. Setting the guess higher won't help because guessing from 8 found columns to 130 seems like a real corner case. Should we add an option to not sample but read full file for estimates like sep, columns, etc.?

jangorecki · 2024-03-21T11:24:17Z

yes, usually people will be happy to have their files loaded, not necessarily to fastest possible way. Then maybe fill=Inf ?

MichaelChirico · 2024-03-21T15:51:38Z

fill=INT_MAX will already work after #5119, right? So then easiest solution is just to look for is.infinite(fill)) and replace it with .Machine$integer.max at the R level...

ben-schwen · 2024-03-21T22:13:16Z

fill=INT_MAX will already work after #5119, right? So then easiest solution is just to look for is.infinite(fill)) and replace it with .Machine$integer.max at the R level...

Sounds good in theory, but unfortunately it allocates 2^31 columns of size 8 byte and kills the R process 😄

jangorecki added the fread label Apr 5, 2018

TobiasGold mentioned this issue Feb 28, 2019

improve fread behaviour on inconsistent number of columns #3436

Closed

This comment has been minimized.

Sign in to view

ben-schwen mentioned this issue Aug 28, 2021

fread: use fill with integer as ncol guess #5119

Merged

ben-schwen added this to the 1.16.0 milestone Jan 5, 2024

MichaelChirico closed this as completed in #5119 Mar 21, 2024

jangorecki reopened this Mar 21, 2024

ben-schwen mentioned this issue Jun 23, 2024

Use fill=Inf to sample full file for right fill guess #6203

Merged

4 tasks

ben-schwen closed this as completed in #6203 Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not suggests fread fill=TRUE if already used #2727

Do not suggests fread fill=TRUE if already used #2727

jangorecki commented Apr 5, 2018

st-pasha commented Apr 5, 2018

jangorecki commented Apr 6, 2018 •

edited

Loading

lotard commented Dec 11, 2018

randomgambit commented Jul 30, 2019

lotard commented Aug 1, 2019

Befrancesco commented Dec 18, 2019

gleesonger commented Jun 19, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

andreas-sudo commented Mar 16, 2021

katiemarker commented May 11, 2023 •

edited

Loading

jangorecki commented Mar 21, 2024

ben-schwen commented Mar 21, 2024 •

edited

Loading

jangorecki commented Mar 21, 2024

MichaelChirico commented Mar 21, 2024

ben-schwen commented Mar 21, 2024

Do not suggests fread fill=TRUE if already used #2727

Do not suggests fread fill=TRUE if already used #2727

Comments

jangorecki commented Apr 5, 2018

st-pasha commented Apr 5, 2018

jangorecki commented Apr 6, 2018 • edited Loading

lotard commented Dec 11, 2018

randomgambit commented Jul 30, 2019

lotard commented Aug 1, 2019

Befrancesco commented Dec 18, 2019

gleesonger commented Jun 19, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

andreas-sudo commented Mar 16, 2021

katiemarker commented May 11, 2023 • edited Loading

jangorecki commented Mar 21, 2024

ben-schwen commented Mar 21, 2024 • edited Loading

jangorecki commented Mar 21, 2024

MichaelChirico commented Mar 21, 2024

ben-schwen commented Mar 21, 2024

jangorecki commented Apr 6, 2018 •

edited

Loading

gleesonger commented Jun 19, 2020 •

edited

Loading

katiemarker commented May 11, 2023 •

edited

Loading

ben-schwen commented Mar 21, 2024 •

edited

Loading