[Bug] fwrite in large environments (tables up to 100M rows) #1968

mgahan · 2016-12-21T21:18:14Z

I have been using fwrite in a large AWS environment detailed below:

Instance: m4.4xlarge
RAM: 64gb
Threads: 16
Cost: $0.862 hourly

I notice that when writing out numeric values, the written output is incorrect. When the same data is coerced to the integer class, the output seems to be correct. This does not seem to be a problem with setDTthreads(1). However, errors start to creep in when setDTthreads(2), albeit less errors than setDTthreads(16). I have detailed the two scenarios below.

Writing numeric output with fwrite

# Set assumptions
library(data.table) # data.table 1.10.0 
# Also tested for 1.10.1 IN DEVELOPMENT built 
# 2016-12-09 03:17:34 UTC
N=1e8
set.seed(1)

# Make test data
check <- data.table(ID=1:N, Time=0:49)
check[, paste0("V",1:5) := lapply(rep(N, 5), rpois, .N)]
sapply(check, class)

# Modify type of test data
check[, names(check) := lapply(.SD, function(x) x+0.1)]
sapply(check, class)
check[, names(check) := lapply(.SD, function(x) x-0.1)]

# Write data
fwrite(check, "check.csv")
rm(check)
gc()

# Read written data
checkWrite <- fread("check.csv")
sapply(checkWrite, class)
	
# Tests
checkWrite[, .N, by=.(Time)]
	
#           Time       N
#    1:        0 2000000
#    2:        1 1986674
#    3:        2 1985449
#    4:        3 1984278
#    5:        4 1984690
#   ---                 
#44725: 99983607       1
#44726: 99980978       1
#44727: 99646303       1
#44728: 99649539       1
#44729: 99653656       1
	
checkWrite[V1 != round(V1,0)]

#             ID Time           V1        V2        V3        V4        V5
#    1:      449   48  97664.38965  99985524  99998767 100009754 100010494
#    2:     2509    8  97643.65527  99996565  99999968  99992995  99974462
#    3:     2751    0     23.84322  99996050 100007735 100008407 100005063
#    4:     6701    0     23.84024  99998435  99993597 100000237  99999963
#    5:     8101    0 781300.35938  99991390  99998825 100005920  99994596
#   ---                                                                   
#22599: 99968305    4     47.67970  99988021 100012376  99996538 100012572
#22600: 99970551    0     23.84325  99996674  99995685 100014894 100013937
#22601: 99973932   31     11.92025 100012133  99990838  99993337  99984441
#22602: 99975660    9     47.68194  99985576  99992718  99982681  99991041
#22603: 99977051    0     47.68710  99998134 100004288 100007002  99990199

Writing integer output with fwrite

# Set assumptions
library(data.table) # data.table 1.10.0
N=1e8
set.seed(1)

# Make test data
check <- data.table(ID=1:N, Time=0:49)
check[, paste0("V",1:5) := lapply(rep(N, 5), rpois, .N)]
sapply(check, class)

# Modify type of test data
check[, names(check) := lapply(.SD, function(x) x+0.1)]
sapply(check, class)
check[, names(check) := lapply(.SD, function(x) x-0.1)]

# Convert classes back to integer
check[, names(check) := lapply(.SD, as.integer)]
sapply(check, class)

# Write data
fwrite(check, "check.csv")
rm(check)
gc()
	
# Read written data
checkWrite <- fread("check.csv")
sapply(checkWrite, class)
	
# Tests
checkWrite[, .N, by=.(Time)]
	
#    Time       N
# 1:    0 2000000
# 2:    1 2000000
# 3:    2 2000000
# 4:    3 4000000
# 5:    5 2000000
#   ---   	
# 46:   45 2000000
# 47:   46 2000000
# 48:   47 2000000
# 49:   48 2000000
# 50:   49 2000000
	
checkWrite[V1 != round(V1,0)]
	
# Empty data.table (0 rows) of 7 cols: ID,Time,V1,V2,V3,V4...

The text was updated successfully, but these errors were encountered:

jmosser · 2016-12-23T17:32:47Z

I have the same problem, reproducible under conditions where multiple threads are used and for numeric values only (integers seem to work fine) as @mgahan wrote. It interestingly appears to be isolated to instances where there's > 1 column in the data table:

library(data.table) # data.table 1.10.0
set.seed(42)
n <- 250000 # number of rows
setDTthreads(4)

# With single variable alone ------------------------------

dt <- data.table(as.numeric(sample(0:1, n, replace = T)))
names(dt) <- "zero_one"

fwrite(dt, "fwrite.csv")
dt_fwrite <- fread("fwrite.csv")

write.csv(dt, "writecsv.csv", row.names = F)
dt_writecsv <- fread("writecsv.csv")

> unique(dt$zero_one)
[1] 1 0
> unique(dt_fwrite$zero_one)
[1] 1 0
> unique(dt_writecsv$zero_one)
[1] 1 0

# With additional numeric variable -----------------------

dt <- data.table(runif(n))
dt[, zero_one := as.numeric(sample(0:1, n, replace = T))]

fwrite(dt, "fwrite.csv")
dt_fwrite <- fread("fwrite.csv")

write.csv(dt, "writecsv.csv", row.names = F)
dt_writecsv <- fread("writecsv.csv")

> unique(dt$zero_one)
[1] 0 1
> unique(dt_fwrite$zero_one)
[1] 0.000000000 1.000000000 0.062500000 0.001953125 0.250000000 0.125000000
[7] 0.500000000
> unique(dt_writecsv$zero_one)
[1] 0 1

mattdowle · 2017-01-24T23:50:22Z

@mgahan @jmosser Great find - thanks for reporting. Would you mind testing it's ok now please.

mgahan · 2017-01-25T22:31:56Z

@mattdowle I just tested it out and my tests check out. Thanks for all the hard work!

mattdowle · 2017-01-25T22:34:28Z

@mgahan Excellent - thanks!

jfrostad · 2017-01-31T17:49:15Z

@mattdowle @mgahan which version of the package has this bug fix?

jfrostad · 2017-04-26T00:24:10Z

@mattdowle I am not sure that this issue is resolved. Seeing it in DT v 10.4.0:

Browse[1]> nrow(dt[population<1])
[1] 0
Browse[1]> nrow(newdt[population<1])
[1] 53

Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv')))
Written 36.9% of 8140 rows in 3 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 5 secs. Written 74.7% of 8140 rows in 4 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 1 secs.

Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv')))
Read 8140 rows and 3022 (of 3022) columns from 0.401 GB file in 00:00:08

Browse[1]> nrow(newdt[population<1])
[1] 38

Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv')))
Written 11.4% of 8140 rows in 2 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 15 secs.Written 49.2% of 8140 rows in 3 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 3 secs.

Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv')))
Read 8140 rows and 3022 (of 3022) columns from 0.401 GB file in 00:00:08

Browse[1]> nrow(newdt[population<1])
[1] 32

Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv')))
Written 0.9% of 8140 rows in 2 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 209 secs.Written 38.8% of 8140 rows in 3 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 4 secs.

Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv')))
Read 8140 rows and 3022 (of 3022) columns from 0.401 GB file in 00:00:08

Browse[1]> nrow(newdt[population<1])
[1] 35

Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv')))
Written 0.9% of 8140 rows in 2 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 209 secs.Written 38.8% of 8140 rows in 3 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 4 secs.

Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv')))
Read 8140 rows and 3022 (of 3022) columns from 0.401 GB file in 00:00:08

Browse[1]> nrow(newdt[population<1])
[1] 26

Browse[1]> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] jsonlite_1.1 RMySQL_0.10.9 DBI_0.5-1 plyr_1.8.4
[5] dplyr_0.5.0 readxl_1.0.0 stringr_1.1.0 magrittr_1.5
[9] lme4_1.1-12 Matrix_1.2-7.1 ggplot2_2.2.0 gridExtra_2.2.1
[13] data.table_1.10.4

loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 splines_3.3.2 MASS_7.3-45 munsell_0.4.3
[5] colorspace_1.3-1 lattice_0.20-34 R6_2.2.0 minqa_1.2.4
[9] tools_3.3.2 grid_3.3.2 gtable_0.2.0 nlme_3.1-128
[13] pacman_0.4.1 lazyeval_0.2.0 assertthat_0.1 tibble_1.2
[17] nloptr_1.0.4 stringi_1.1.2 cellranger_1.1.0 scales_0.4.1
Browse[1]>

MichaelChirico · 2017-04-26T00:26:25Z

Resolution is not on CRAN. Install the current development version (1.10.5)

jfrostad · 2017-04-26T00:31:11Z

@MichaelChirico I don't think so. Just installed dev version, same behavior:

Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv')))
Read 8140 rows and 3022 (of 3022) columns from 0.401 GB file in 00:00:08

Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv')))
Written 0.9% of 8140 rows in 2 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 209 secs.Written 76.6% of 8140 rows in 3 secs using 40 threads. anyBufferGrown=no; maxBuffUsed=48%. Finished in 0 secs.

Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv')))

Browse[1]> nrow(dt[population<1])
[1] 0

Browse[1]> nrow(newdt[population<1])
[1] 26

Browse[1]> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] jsonlite_1.1 RMySQL_0.10.9 DBI_0.5-1 plyr_1.8.4
[5] dplyr_0.5.0 readxl_1.0.0 stringr_1.1.0 magrittr_1.5
[9] lme4_1.1-12 Matrix_1.2-7.1 ggplot2_2.2.0 gridExtra_2.2.1
[13] data.table_1.10.5

loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 splines_3.3.2 MASS_7.3-45 munsell_0.4.3
[5] colorspace_1.3-1 lattice_0.20-34 R6_2.2.0 minqa_1.2.4
[9] tools_3.3.2 grid_3.3.2 gtable_0.2.0 nlme_3.1-128
[13] pacman_0.4.1 lazyeval_0.2.0 assertthat_0.1 tibble_1.2
[17] nloptr_1.0.4 stringi_1.1.2 cellranger_1.1.0 scales_0.4.1

jfrostad · 2017-04-26T00:33:18Z

Potentially could be specific to our cluster environment but others at my institute are seeing this same issue. Seems to be related to columns being accidentally reordered inside a row.

I would recommend that people be very cautious using fwrite in production code at this stage, this bug seems to be pervasive and is really difficult to track down in large outputs.

mgahan changed the title ~~Bug in fwrite in large environments (tables up to 100M rows)~~ [Bug] in fwrite in large environments (tables up to 100M rows) Dec 21, 2016

mgahan changed the title ~~[Bug] in fwrite in large environments (tables up to 100M rows)~~ [Bug] fwrite in large environments (tables up to 100M rows) Dec 21, 2016

mattdowle added this to the v1.10.2 milestone Jan 24, 2017

mattdowle closed this as completed in bd46123 Jan 24, 2017

mattdowle modified the milestones: Candidate, v1.10.2 Jan 31, 2017

jangorecki added the fwrite label Mar 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] fwrite in large environments (tables up to 100M rows) #1968

[Bug] fwrite in large environments (tables up to 100M rows) #1968

mgahan commented Dec 21, 2016 •

edited

Loading

jmosser commented Dec 23, 2016 •

edited

Loading

mattdowle commented Jan 24, 2017

mgahan commented Jan 25, 2017

mattdowle commented Jan 25, 2017

jfrostad commented Jan 31, 2017

jfrostad commented Apr 26, 2017

MichaelChirico commented Apr 26, 2017

jfrostad commented Apr 26, 2017

jfrostad commented Apr 26, 2017

[Bug] fwrite in large environments (tables up to 100M rows) #1968

[Bug] fwrite in large environments (tables up to 100M rows) #1968

Comments

mgahan commented Dec 21, 2016 • edited Loading

Writing numeric output with fwrite

Writing integer output with fwrite

jmosser commented Dec 23, 2016 • edited Loading

mattdowle commented Jan 24, 2017

mgahan commented Jan 25, 2017

mattdowle commented Jan 25, 2017

jfrostad commented Jan 31, 2017

jfrostad commented Apr 26, 2017

MichaelChirico commented Apr 26, 2017

jfrostad commented Apr 26, 2017

jfrostad commented Apr 26, 2017

mgahan commented Dec 21, 2016 •

edited

Loading

jmosser commented Dec 23, 2016 •

edited

Loading