Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not suggests fread fill=TRUE if already used #2727

Closed
jangorecki opened this issue Apr 5, 2018 · 16 comments · Fixed by #5119 or #6203
Closed

Do not suggests fread fill=TRUE if already used #2727

jangorecki opened this issue Apr 5, 2018 · 16 comments · Fixed by #5119 or #6203
Labels
Milestone

Comments

@jangorecki
Copy link
Member

current fread behavior.

dt=fread("Rprofmem.out", header=FALSE)
#Warning message:
#In fread("Rprofmem.out", header = FALSE) :
#  Stopped early on line 2. Expected 3 fields but found 2. Consider fill=TRUE and comment.char=. First #discarded non-empty line: <<new page:"library" >>
dt=fread("Rprofmem.out", header=FALSE, fill=TRUE)
#Warning message:
#In fread("Rprofmem.out", header = FALSE, fill = TRUE) :
#  Stopped early on line 90. Expected 7 fields but found 8. Consider fill=TRUE and comment.char=. First #discarded non-empty line: <<new page:"close" "readRDS" "FUN" "lapply" "find.package" "system.file" #"library" >>

Before printing this warning we should check if fill=TRUE was used.

@st-pasha
Copy link
Contributor

st-pasha commented Apr 5, 2018

If fill=TRUE then fread should add new columns as necessary. And then no error will be emitted.

@jangorecki
Copy link
Member Author

jangorecki commented Apr 6, 2018

Rprofmem.out.zip
There are other issues in this file, so don't try hard to parse it, attaching so issue can be reproduced.
profmem package also fail to parse it.

@lotard
Copy link

lotard commented Dec 11, 2018

One up this, which can be reproduced using this simple toy example:

> body = paste0(rep("1 2\n", 1000), collapse="")
> main = paste0(body, "1 2 3\n", body, collapse="")
> fread(main, fill=T)
Warning message:
In fread(main, fill=T) :
  Stopped early on line 1001. Expected 2 fields but found 3.
  Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3>>

Issue seems to be incomplete sampling of possible number of columns. I don't necessarily think max(countfields(table)) is the right solution here (it is computationally expensive), but allowing the user to force the number of columns, if they know it (e.g. fill=3 in my example)

@randomgambit
Copy link

I have the same issue actually. Is there a workaround for now?
Thanks!

@lotard
Copy link

lotard commented Aug 1, 2019

Find out the max number of fields, create a 'fake' line with that number of fields at the beginning of the file, read the file, scrap that line.

@Befrancesco
Copy link

I have the same issue, but my csv is 3.5 gb.
I have to modify it by powershell or is there an other way?

Thank you in advance.

@gleesonger
Copy link

gleesonger commented Jun 19, 2020

Having the same issue, is there any work around which doesn't involve patching the start of the file contents?

@GabeAl

This comment has been minimized.

@loukesio

This comment has been minimized.

@andreas-sudo
Copy link

I have the same issue.
In my case the file stops on line 1384.
It is a file I read from the webanalytics api from matomo. So not corrupted as such :-)

@katiemarker
Copy link

katiemarker commented May 11, 2023

I'm still getting this same issue (May 2023), it looks like there is a pull request to fix this but it hasn't been implemented. If any of your rows are longer than the longest in some sample it is taking (maybe first 100 rows?) then it will give below warning, even when fill = T is already used, and stop reading at that line. fread throws an error if you try to put an integer with ncol guess.

Warning message:
In fread("file", :
Stopped early on line 165. Expected 34 fields but found 35. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<808,615,3261,608,1755,518,3469,3498,6624,495,317,6687,889,282,610,353,235,247,281,341,503,680,796,1012,6254,585,652,857,579,104,1771,859,881,768,1957>

Any idea of when this fix will get implemented? I'm using data.table v1.14.8

Here is the verbose read out if that is helpful:

test2 <- fread("file", header = F, stringsAsFactors = F, fill = T, verbose = T, sep = ",")

This installation of data.table has not been compiled with OpenMP support.
omp_get_num_procs() 1
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 1
omp_get_max_threads() 1
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 1 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open [01]
Check arguments
Using 1 threads (omp_get_max_threads()=1, nth=1)
NAstrings = [<>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file file
File opened, size = 42.86KB (43893 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<3818,3076,3273>>
[06] Detect separator, quoting rule, and ncolumns
Using supplied sep ',' sep=',' with 34 fields using quote rule 0
Detected 3 columns on line 1. This line is either column names or first data row. Line starts as: <<3818,3076,3273>>
Quote rule picked = 0
fill=true and the most number of columns found is 34
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 1 because (43892 bytes from row 1 to eof) / (2 * 2400 jump0size) == 9
Type codes (jump 000) : 5555555555555555555555555555555555
Quote rule 0 Type codes (jump 001) : 5555555555555555555555555555555555
Quote rule 0 ===== Sampled 200 rows (handled \n inside quoted fields) at 2 jump points
Bytes from first data row on line 1 to the end of last row: 43892 Line length: mean=15.57 sd=18.32 min=4 max=143 Estimated number of rows: 43892 / 15.57 = 2819 Initial alloc = 5638 rows (2819 + 100%) using bytes/max(mean-2sd,min) clamped between [1.1estn, 2.0*estn] =====
[08] Assign column names [09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5555555555555555555555555555555555
[10] Allocate memory for the datatable
Allocating 34 column slots (34 - 0 dropped) with 5638 rows
[11] Read the data jumps=[0..1), chunk_size=1048576, total_size=43892 Restarting team from jump 0. nSwept==0 quoteRule==1 jumps=[0..1), chunk_size=1048576, total_size=43892 Restarting team from jump 0. nSwept==0 quoteRule==2 jumps=[0..1), chunk_size=1048576, total_size=43892
Restarting team from jump 0. nSwept==0 quoteRule==3 jumps=[0..1), chunk_size=1048576, total_size=43892
Read 164 rows x 34 columns from 42.86KB (43893 bytes) file in 00:00.074 wall clock time
[12] Finalizing the datatable Type counts:
34 : int32 '5' ============================= 0.074s ( 99%) Memory map 0.000GB file 0.000s ( 0%) sep=',' ncol=34 and header detection 0.000s ( 0%) Column type detection using 200 sample rows 0.000s ( 0%) Allocation of 5638 rows x 34 cols (0.001GB) of which 164 ( 3%) rows used 0.000s ( 0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 164 rows) using 1 threads + 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times) + 0.000s ( 0%) Transpose + 0.000s ( 0%) Waiting 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions 0.074s Total

@ben-schwen ben-schwen added this to the 1.16.0 milestone Jan 5, 2024
@jangorecki
Copy link
Member Author

I wouldn't say it is closed. The behavior observed on the initially reported issue and current master is far from ideal. Now it suggests to use fill=10, but getting error again suggesting to use fill=11.

@jangorecki jangorecki reopened this Mar 21, 2024
@ben-schwen
Copy link
Member

ben-schwen commented Mar 21, 2024

True its raising the suggestion until fill=130L is reached where it finally reads the file. Setting the guess higher won't help because guessing from 8 found columns to 130 seems like a real corner case. Should we add an option to not sample but read full file for estimates like sep, columns, etc.?

@jangorecki
Copy link
Member Author

yes, usually people will be happy to have their files loaded, not necessarily to fastest possible way. Then maybe fill=Inf ?

@MichaelChirico
Copy link
Member

fill=INT_MAX will already work after #5119, right? So then easiest solution is just to look for is.infinite(fill)) and replace it with .Machine$integer.max at the R level...

@ben-schwen
Copy link
Member

fill=INT_MAX will already work after #5119, right? So then easiest solution is just to look for is.infinite(fill)) and replace it with .Machine$integer.max at the R level...

Sounds good in theory, but unfortunately it allocates 2^31 columns of size 8 byte and kills the R process 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet