-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representing & checking Dataset schemas #1900
Comments
I think the right word for this may be "schema". For applications and models (rather than data analysis), these sort of conventions can be super-valuable. I like the idea of declarative spec that can be validated. Just googling around, I came up with pandas-validator: https://github.com/c-data/pandas-validator |
Right! 🤦♂️
Interesting, thanks. Do you think this fits into a 'function which validates', rather than a Mypy-like type annotation? I think ideally there would be a representation of the schema that could work with both, so maybe this isn't the important question atm. |
And let me know if there are already textual schema definitions from other libraries that you think are good, before we go and build our own (we don't work with any netCDF-like files so don't have that context) |
|
Somewhat related to this issue, I have implemented in xarray-simlab some logic to validate I'm currently in the process of refactoring this using attrs, which supports both validator functions and type annotations. Not sure how to use the latter for xarray objects, though (BTW I wasn't aware of python/typing#513, good to know!!). I agree that it would be nice to have a more generic way to describe xarray objects that can be reused in many contexts. |
@benbovy That looks v interesting. Separately - I didn't know about the project but looks awesome. Do we have a list of projects that integrate xarray? Let's start one somewhere if not @pydata/xarray ? |
@maxim-lian you're right. In this case
There is an ongoing discussion in #1850 about having something like xarray-contrib (likely a github organization). |
The commentary in python/typing#513, and @shoyer 's doc https://docs.google.com/document/d/1vpMse4c6DrWH5rq2tQSx3qwP_m_0lyn-Ij4WHqQqRHY/edit#heading=h.rkj7d39awayl are good & growing I'll close this as I think riding on those coattails - with the addition of names and Datasets as containers - makes the most sense. (though reopen if we think there's something we could productively do separately) |
I'm really interested in a machine-readable schema for xarray! Pandera provides machine-readable schemas for Pandas and, as of version 0.7, panderas has decoupled pandera and pandas types to make pandera more useful for things like xarray. I haven't tried |
Awesome -- would love to hear how this goes! |
OK, I think But Pydantic looks promising. Here's a very quick coding experiment showing one way to use pydantic with xarray... it validates a few things; but it's not super-useful as a human-readable specification for what's going on inside a DataArray or Dataset. |
|
Related to the Pandera integration, we are prototyping the xarray schema validation functionality in the xarray-schema project. |
Does this project do (part of?) what's needed? |
What would be the best way to canonically describe a dataset, which could be read by both humans and machines?
For example, frequently in our code we have docstrings which look something like:
This helps when attempting to understand what code is doing while only reading it.
But this isn't consistent between docstrings and can't be read or checked by a machine.
Has anyone solved this problem / have any suggestions for resources out there?
Tangentially related to python/typing#513 (but our issues are less about the type, dimension sizes, and more about the arrays within a dataset, their dimensions, and their names)
The text was updated successfully, but these errors were encountered: