-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minimal implementation of ingesting Parquet and CSV files #327
Conversation
Thanks @voonhous I have two comments
I think it would make more sense if you extended the existing At the What do you think? |
Also, would you mind updating the title of the PR to something more descriptive? We use the PRs as items in our change logs. |
/retest |
/test test-core-and-ingestion |
I have created this code snippet which creates a testing dataframe. It has all the types besides lists of boolean. I think we should ensure that we can ingest this as both a parquet and a pandas dataframe. https://gist.github.com/woop/d074ded542bc2b6ec5a0b5a96c72e9ab |
/retest |
1 similar comment
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: voonhous, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR tries to follow minimal implementation to ingest a parquet file using PyArrow.
This is achieved by reading a Parquet file using PyArrow, batching it into RecordBatches before ingesting it with the existing code.
my_file.parquet → PyArrow Table → RecordBatches → Dataframe → FeatureRows (existing code) → Stream
Other modifications: