Add minimal implementation of ingesting Parquet and CSV files #327

voonhous · 2019-11-26T11:05:04Z

This PR tries to follow minimal implementation to ingest a parquet file using PyArrow.
This is achieved by reading a Parquet file using PyArrow, batching it into RecordBatches before ingesting it with the existing code.

my_file.parquet → PyArrow Table → RecordBatches → Dataframe → FeatureRows (existing code) → Stream

Other modifications:

Generator function to encode chunks of PyArrow table of type RecordBatch to yield FeatureRow objects.

woop · 2019-11-26T17:17:24Z

Thanks @voonhous

I have two comments

Can you please submit all of the formatting and other changes separately? It makes it incredibly difficult to review your code if you submit 700+ lines of code difference.
Is there a good reason you wrap ingest with ingest_file? It seems like ingest has steps that both it and ingest_file should share, like validating the dataframe, getting the feature set, and inferring/applying the schema from the dataframe. These should not be rerun for every batch.

I think it would make more sense if you extended the existing ingest so that it took either a dataframe, path, or file-like object. You would then handle it just like in ingest_file, where you use different methods to read the source based on what the user provides, but ultimately it results in a table.

At the infer_schema step, a single batch can be materialized into a DF for creating the schema. Then the rest of the table is passed into ingest_kafka where all iteration can happen.

What do you think?

woop · 2019-11-27T02:38:00Z

Also, would you mind updating the title of the PR to something more descriptive? We use the PRs as items in our change logs.

davidheryanto · 2019-11-27T05:52:37Z

/retest

davidheryanto · 2019-11-27T06:00:05Z

/test test-core-and-ingestion
Trying to figure out if the integration test in feast-ingestion is flaky i.e. the test can fail with the same test configuration, specifically this test feast.ingestion.ImportJobTest

woop · 2019-11-28T06:20:33Z

I have created this code snippet which creates a testing dataframe. It has all the types besides lists of boolean. I think we should ensure that we can ingest this as both a parquet and a pandas dataframe.

https://gist.github.com/woop/d074ded542bc2b6ec5a0b5a96c72e9ab

…a feature set

woop · 2019-11-30T07:58:18Z

/retest

woop · 2019-12-01T09:24:08Z

/retest

woop · 2019-12-01T11:55:55Z

/lgtm
/approve

feast-ci-bot · 2019-12-01T11:55:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: voonhous, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

voonhous added 4 commits November 26, 2019 16:06

[WIP] Ingest Parquet in chunks

c02fa19

[WIP] Added iterator to create chunk of feature rows given pyarrow table

53e3478

[WIP] Removed unused ingest_file in ingest.py

440159c

[WIP] Standardised docstring to use reStructuredText format with type

3837576

voonhous requested review from davidheryanto, pradithya, thirteen37, tims, woop and zhilingc as code owners November 26, 2019 11:05

feast-ci-bot added the size/XL label Nov 26, 2019

voonhous added 3 commits November 26, 2019 19:19

[WIP] Adding missing required module (pyarrow)

b863239

[WIP] Removed circular dependency to fix tests

7b4878d

[WIP] Removed circular dependency (Removed import FeatureSet)

a269003

voonhous added 2 commits November 27, 2019 18:14

[WIP] Reverting changes

e8961eb

[WIP] Re-implementing ingestion of parquet and csv files

10ba26d

feast-ci-bot added size/L and removed size/XL labels Nov 27, 2019

voonhous changed the title ~~Pyarrow dev~~ Adding minimal implementation of Parquet and CSV files Nov 27, 2019

voonhous changed the title ~~Adding minimal implementation of Parquet and CSV files~~ Adding minimal implementation of ingesting Parquet and CSV files Nov 27, 2019

voonhous changed the title ~~Adding minimal implementation of ingesting Parquet and CSV files~~ Add minimal implementation of ingesting Parquet and CSV files Nov 27, 2019

woop added 4 commits November 29, 2019 09:30

Upgrade type inference to use value types instead of column types

8f5c8d4

Fix bug in applying a new feature set from a dataframe not returning …

44e9128

…a feature set

Upgrade ingestion to use a queue

0779a64

Implement column value type inference

2625343

Add e2e test for file based ingestion

084a719

feast-ci-bot added size/XL and removed size/L labels Nov 29, 2019

woop and others added 4 commits November 29, 2019 09:46

Merge branch 'master' into pyarrow-dev

a626af0

[WIP] Changing string for easier debugging

63c183c

Merge ingest and ingest_file methods on Feast client

599f254

Fixed typo in typing

7d3053e

woop added 3 commits December 1, 2019 17:34

Remove dataframe keyword from tests

0d434bf

Add all_types_parquet.yaml

8f3ab39

Fix broken tests for new ingestion client

b3931e5

feast-ci-bot assigned woop Dec 1, 2019

feast-ci-bot added the lgtm label Dec 1, 2019

feast-ci-bot added the approved label Dec 1, 2019

feast-ci-bot merged commit 2abeabd into master Dec 1, 2019

voonhous deleted the pyarrow-dev branch December 17, 2019 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimal implementation of ingesting Parquet and CSV files #327

Add minimal implementation of ingesting Parquet and CSV files #327

voonhous commented Nov 26, 2019 •

edited

Loading

woop commented Nov 26, 2019 •

edited

Loading

woop commented Nov 27, 2019

davidheryanto commented Nov 27, 2019

davidheryanto commented Nov 27, 2019

woop commented Nov 28, 2019

woop commented Nov 30, 2019

woop commented Dec 1, 2019

woop commented Dec 1, 2019

feast-ci-bot commented Dec 1, 2019

Add minimal implementation of ingesting Parquet and CSV files #327

Add minimal implementation of ingesting Parquet and CSV files #327

Conversation

voonhous commented Nov 26, 2019 • edited Loading

woop commented Nov 26, 2019 • edited Loading

woop commented Nov 27, 2019

davidheryanto commented Nov 27, 2019

davidheryanto commented Nov 27, 2019

woop commented Nov 28, 2019

woop commented Nov 30, 2019

woop commented Dec 1, 2019

woop commented Dec 1, 2019

feast-ci-bot commented Dec 1, 2019

voonhous commented Nov 26, 2019 •

edited

Loading

woop commented Nov 26, 2019 •

edited

Loading