-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when writing string coordinate variables to zarr #3476
Comments
Hi @jsadler2 - Can you show us what |
Sure
|
Thanks @jsadler2 - I think this is a bug in xarray. We should be able to round trip the |
I'm experiencing the same issue, which seems to be also related to one of my coordinates having object as datatype. Luckily, the workaround proposed by @jsadler2 works in my case, too. |
I ran into the same issue. It seems like zarr is inserting
|
Indeed. And it appears that these lines are the culprits: xarray/xarray/backends/zarr.py Lines 468 to 471 in 83706af
ipdb> v
<xarray.Variable (x: 3)>
array(['a', 'b', 'c'], dtype=object)
ipdb> check
False
ipdb> vn
'x'
ipdb> encoding = extract_zarr_variable_encoding(v, raise_on_invalid=check, name=vn)
ipdb> encoding
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': [VLenUTF8()]} xarray/xarray/backends/zarr.py Lines 480 to 482 in 83706af
Zarr appears to be ignoring the filter information from xarray. Zarr proceeds to extracting its own filter. As a result, we end up with two filters: ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding)
ipdb> self.ds['x']._meta
{'zarr_format': 2, 'shape': (3,), 'chunks': (3,), 'dtype': dtype('O'), 'compressor': {'blocksize': 0, 'clevel': 5, 'cname': 'lz4', 'id': 'blosc', 'shuffle': 1}, 'fill_value': None, 'order': 'C', 'filters': [{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]} ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}] As @borispf and @jsadler2 suggested, clearing the filters from encoding before initiating the zarr store creation works: ipdb> enc # without filters
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': []}
ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **enc)
ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}] @jhamman since you are more familiar with the internals of zarr + xarray, should we default to ignoring filter information from xarray and let zarr take care of the extraction of filter information? |
Thanks @andersy005 for the write up and for digging into this. Doesn't this seem like it could be a bug in Zarr's |
Hi, I also keep running into this issue all the time. The only workaround that seems to help is the following: to_store = xrds.copy()
for var in to_store.variables:
to_store[var].encoding.clear() |
Thanks for that, looks like I just came across this. |
Think it worked with one variable - @Hoeze workaround was necessary for more than one. |
This has been happening a lot to me lately when writing to zarr. Thanks to @bolliger32 for the tip - this usually works like a charm: for v in list(ds.coords.keys()):
if ds.coords[v].dtype == object:
ds.coords[v] = ds.coords[v].astype("unicode")
for v in list(ds.variables.keys()):
if ds[v].dtype == object:
ds[v] = ds[v].astype("unicode") For whatever reason, clearing the encoding and/or using note the flag raised by @FlorisCalkoen below - don't just throw this at all your writes! there are other object types (e.g. CFTime) which you probably don't want to convert to string. This is just a patch to get around this issue for dataarrays with string coords/variables. |
@delgadom I just noticed that your proposed solution has the side effect of also converting
I updated your lines using @Hoeze' clear function and that seems to work for now.
|
ha - yeah that's a good flag. I definitely didn't intend for that to be a universally applied patch! so probably should have included a buyer beware. but we did find that clearing the encoding doesn't always do the trick for string arrays. So a comprehensive patch will probably need to be more nuanced. |
My impression is that keeping the zarr encoding leads to a bunch of issues (see my issue above) or the current one. There also seems to be an issue with chunking being preserved because the array encodings arent overwritten, but I cant find that issue right now. Since all of these issues are resolved by popping the zarr encoding I am wondering what are the downsides of this and whether it'd be easier to not keep that encoding at all? |
I find this annoying too! I think we can fold it into #6323, though it's also common enough that maybe we leave it open for future travellers |
I hit this issue this week, and the workaround in #3476 (comment) worked for me. |
I run into this issue fairly commonly (usually during playing copying of a dataset |
I saved an xarray dataset to zarr using
to_zarr
. I then later tried to read that dataset from the original zarr, re-chunk it, and then write to a new zarr. When I did that I get a strange error. I attached a zip of minimal version of the zarr dataset that I am using.test_sm_zarr.zip
MCVE Code Sample
Expected Output
No error
Problem Description
I get this error:
BUT
I think it has something to do with the datatype of one of my coordinates,
site_code
. Because, if it do this I get no error:Before converting the datatype of the
site_code
coordinate isobject
Output of
xr.show_versions()
xarray: 0.12.2
pandas: 0.25.1
numpy: 1.17.1
scipy: 1.3.1
netCDF4: 1.5.1.2
pydap: installed
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.3.0
distributed: 2.5.1
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.1.2
IPython: 7.8.0
sphinx: None
The text was updated successfully, but these errors were encountered: