-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread feather #2026
Comments
I think a separate package like Of course feel free to re-open if you think otherwise @mattdowle :) |
Yes I agree in general |
I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files. The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets. |
I'm in a similar boat to @DrMaphuse. Our entire stack is all parquet+arrow based and data.table sits a little awkwardly alongside this. Given the architectural differences and how the data are represented in memory, my guess is that some conversation penalty (whether setDT or otherwise) is unavoidable. OTOH it would be great to be able to use data.table's |
I haven't used parquet before. When you mentioned |
No, the data is already in a BUT, When writing a > test_dt <- data.table(c(1, 2, 3, 4))
> test_df <- data.frame(c(1, 2, 3, 4))
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE
> test_dt %>% write_parquet(x = ., 'test_dt.parquet')
> test_df %>% write_parquet(x = ., 'test_df.parquet')
> test_dt <- read_parquet('test_dt.parquet')
> test_df <- read_parquet('test_df.parquet')
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference. It would be super cool if we could use this to generate |
@DrMaphuse could you profile what part of Not using |
This is simply writing back the R attributes when converting from The same results can be reached by manually setting up the following procedure. > tbl <- data.frame(x = c(1, 2)) |> arrow::arrow_table()
> tbl$metadata$r$attributes$class <- c("data.table", "data.frame")
> arrow::write_parquet(tbl, "test.parquet")
> library(data.table)
data.table 1.14.6 using 8 threads (see ?getDTthreads). Latest news: r-datatable.com
> arrow::read_parquet("test.parquet")
x
1: 1
2: 2 You can check the metadata of this file with pyarrow, for example. >>> import pyarrow.parquet
>>> md = pyarrow.parquet.read_metadata("test.parquet")
>>> md.metadata
{b'ARROW:schema': b'/////4gBAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAABQBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAADAAAAAEAAAByAAAA4gAAAEEKMwoyNjI2NTgKMTk3ODg4CjUKVVRGLTgKNTMxCjIKNTMxCjEKMTYKMgoyNjIxNTMKMTAKZGF0YS50YWJsZQoyNjIxNTMKMTAKZGF0YS5mcmFtZQoxMDI2CjEKMjYyMTUzCjUKbmFtZXMKMTYKMQoyNjIxNTMKNQpjbGFzcwoyNTQKNTMxCjEKMjU0CjEwMjYKNTExCjE2CjEKMjYyMTUzCjEKeAoyNTQKMTAyNgo1MTEKMTYKMgoyNjIxNTMKMTAKYXR0cmlidXRlcwoyNjIxNTMKNwpjb2x1bW5zCjI1NAoAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEDEAAAABgAAAAEAAAAAAAAAAEAAAB4AAYACAAGAAYAAAAAAAIAAAAAAA==', b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n2\n531\n1\n16\n2\n262153\n10\ndata.table\n262153\n10\ndata.frame\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n5\nclass\n254\n531\n1\n254\n1026\n511\n16\n1\n262153\n1\nx\n254\n1026\n511\n16\n2\n262153\n10\nattributes\n262153\n7\ncolumns\n254\n'} |
As suggested here to avoid needing to use or wrap with
setDT
:https://twitter.com/bennetvoorhees/status/830070242659414016
(I guess that rio returns a data.frame or tibble, so making fread do it is perhaps clearer as people use fread to return data.table.)
The text was updated successfully, but these errors were encountered: