Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Ability to data profile node outputs for creating data quality checks #165

Closed
skrawcz opened this issue Aug 2, 2022 · 1 comment
Closed
Labels

Comments

@skrawcz
Copy link
Collaborator

skrawcz commented Aug 2, 2022

Is your feature request related to a problem? Please describe.
Data profiling is a way to help bootstrap creating data quality checks.
Data profiling is also a way to facilitate data exploration, by providing summary statistics over data.

Describe the solution you'd like
A user should be able to profile their DAG, or a set of nodes, and get out some summary statistics.
Those statistics could then be used to bootstrap data quality, i.e. check_output(), decorators, but the output should be standalone.

Describe alternatives you've considered
Haven't considered many options. But there are a few libraries that do data profiling already.

Additional context
Systems like whylogs, great expectations, use profiling to help with the user experience.
Standalone libraries like https://github.com/capitalone/DataProfiler also exist.

#149 does a little to prototype in this area too.

@elijahbenizzy
Copy link
Collaborator

We are moving repositories! Please see the new version of this issue at DAGWorks-Inc/hamilton#40. Also, please give us a star/update any of your internal links.

Note that everything else (slack community, pypi packages, etc...) will not change at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants