Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray integration #705

Open
1 of 3 tasks
cosmicBboy opened this issue Dec 11, 2021 · 5 comments
Open
1 of 3 tasks

Xarray integration #705

cosmicBboy opened this issue Dec 11, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Dec 11, 2021

Is your feature request related to a problem? Please describe.

xarray is a project that provides a dict-like data container abstraction for ndimensional arrays. It shares some commonalities with pandas, but there many key differences (e.g. coords and attrs).

After chatting with @jhamman about this approach, we decided it would be appropriate to park xarray-schema within the pandera codebase. This issue tracks the planned integration of xarray-schema into the pandera codebase.

Describe the solution you'd like

A good start for this integration is to add a pandera.xarray module exposing the schema and schema component classes specific to xarray:

import numpy as np
import xarray as xr
from pandera.xarray import DataArraySchema, DatasetSchema

da = xr.DataArray(np.ones(4, dtype='i4'), dims=['x'], name='foo')

schema = DataArraySchema(dtype=np.integer, name='foo', shape=(4, ), dims=['x'])

schema.validate(da)

TODO

Describe alternatives you've considered

The main alternative to this integration is to keep xarray-schema as a separate project that's interoperable with pandera. However, given that pandera plans on expanding its scope to validate data containers beyond pandas, it would benefit this project to maintain schema interfaces for multiple (not just pandas-like) data container libraries.

Additional context

@leroyvn
Copy link

leroyvn commented Oct 13, 2023

Hi, I've been looking for an xarray validation library for a while now and I was wondering: is integration in Pandera still planned? Thanks!

@avcopan
Copy link

avcopan commented Dec 17, 2024

In reply to @cosmicBboy's request on Discord, here is a simple xarray.DataArray schema that I would like to validate using Pandera:

class RateConstant(DataArrayModel):
    # coords:
    pressure  # sequence of floats: [0.1, 1.0, 10.0, ...]
    temperature  # sequence of floats: [500, 600, 700, ...]

What I want to be able to do is:

  1. Validate that the xarray has coordinates names "pressure" and "temperature" (actually checking their types is low priority). Example:
# create the DataArray:
ktp = xarray.DataArray(
    data=[[1e1, 1e2, 1e3, 1e4], [1e5, 1e6, 1e7, 1e8], [1e9, 1e10, 1e11, 1e12]],
    coords={
        "pressure": [1, 10, numpy.inf],
        "temperature": [1000, 1500, 2000, 2500],
    }
)

# validate the DataArray:
ktp = RateConstant.validate(ktp)
  1. Use RateConstant.pressure and RateConstant.temperature to access these coordinate names from the schema (to avoid typos and allow me to easily update the code that depends on the schema, if needed). Example:
>>> print(RateConstant.temperature)
"temperature"

Note that this is for a simple DataArray, not a Dataset.

@cosmicBboy
Copy link
Collaborator Author

@avcopan thank you! just to follow up, I'm not super familiar with xarray, so it would it help a lot if you could write out the xarray-native assertions that you would write for the following validations:

Validate that the xarray has coordinates names "pressure" and "temperature" (actually checking their types is low priority). Example:
Use RateConstant.pressure and RateConstant.temperature to access these coordinate names from the schema (to avoid typos and allow me to easily update the code that depends on the schema, if needed). Example:

I can then start to map those over to the pandera api

@avcopan
Copy link

avcopan commented Dec 27, 2024

Oh, sure!

With the above example, you would just do:

assert "pressure" in ktp.coords
assert "temperature" in ktp.coords

@avcopan
Copy link

avcopan commented Dec 27, 2024

<DataArray object>.coords acts essentially like a dictionary mapping coordinate keys (e.g. "temperature" and "pressure") to their values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants