-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test fsspec roundtrip #42
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #42 +/- ##
==========================================
- Coverage 90.18% 88.92% -1.26%
==========================================
Files 14 14
Lines 998 867 -131
==========================================
- Hits 900 771 -129
+ Misses 98 96 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I strongly recommend implementing data read just for catching errors in the time coordinate. It is incredibly common for time vectors to be encoded differently in different files in the "same dataset". |
What do you mean specifically? Like an optional argument to |
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False).isel( | ||
time=slice(None, 2000) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False).isel( | |
time=slice(None, 2000) | |
) | |
# set up example xarray dataset | |
ds = xr.tutorial.open_dataset("air_temperature", decode_times=True).isel(time=slice(None, 2000)) | |
del ds.time.encoding["units"] |
This is the hard case, the two time variables will be encoded differently, and the concat makes no sense any more.
IIRC kerchunk supports this by decoding time and inlining bytes in the JSON. I think you'll want to at least raise an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will they be encoded differently?
In what case should I be raising an error? When the encoding attributes are different?
(Sorry I'm not very familiar with the encoding step in general)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will they be encoded differently?
The encoding chooses units separately for each dataset since they are written separately.
This happens in the wild, and you'll want to catch it.
When the encoding attributes are different?
I think so. How do you handle someone trying to concat different files with different encodings? Are the encodings simply discarded and you use those from the first file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks.
How do you handle someone trying to concat different files with different encodings? Are the encodings simply discarded and you use those from the first file?
At the moment I'm not doing anything with encodings explicitly, and my tests have been testing with ManifestArrays
created in-memory. So I'm not actually deliberately using them anywhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #69 added the ability to load selected variables in as normal lazy numpy arrays, so they will get decoded by xarray automatically.
EDIT: But we can't use it to make this roundtripping test work until I implement writing those numpy arrays out "inlined" into the kerchunk reference files, see #62.
I also opened issue #68 to discuss the encoding problem more generally.
…s/VirtualiZarr into test_fsspec_roundtrip
This works (locally) except for two things:
It requires this PR to xarray so as not to try to create pandas indexes for the dimension coordinates automatically Opt out of auto creating index variables pydata/xarray#8711
The time coordinate is not being decoded, so the roundtripped-dataset only matches the original because I used
decode_times=False
when opening the original tutorial dataset. If I get rid of that this test fails with