Error when writing string coordinate variables to zarr #3476

jsadler2 · 2019-11-01T19:32:29Z

I saved an xarray dataset to zarr using to_zarr. I then later tried to read that dataset from the original zarr, re-chunk it, and then write to a new zarr. When I did that I get a strange error. I attached a zip of minimal version of the zarr dataset that I am using.
test_sm_zarr.zip

MCVE Code Sample

import xarray as xr

sm_from_zarr = xr.open_zarr('test_sm_zarr')
sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')

Expected Output

No error

Problem Description

I get this error:

C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\merge.py:18: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)
C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\dataarray.py:1829: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  'DataArray', pd.Series, pd.DataFrame, pd.Panel]:
Traceback (most recent call last):
  File "rechunk_test.py", line 38, in <module>
    sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\dataset.py", line 1414, in to_zarr
    consolidated=consolidated, append_dim=append_dim)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\api.py", line 1101, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\api.py", line 929, in dump_to_store
    unlimited_dims=unlimited_dims)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\zarr.py", line 366, in store
    unlimited_dims=unlimited_dims)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\zarr.py", line 432, in set_variables
    writer.add(v.data, zarr_array)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\common.py", line 173, in add
    target[...] = source
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1659, in _chunk_setitem
    fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1723, in _chunk_setitem_nosync
    cdata = self._encode_chunk(chunk)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1769, in _encode_chunk
    chunk = f.encode(chunk)
  File "numcodecs/vlen.pyx", line 108, in numcodecs.vlen.VLenUTF8.encode
TypeError: expected unicode string, found 20

BUT
I think it has something to do with the datatype of one of my coordinates, site_code. Because, if it do this I get no error:

import xarray as xr

sm_from_zarr = xr.open_zarr('test_sm_zarr')
sm_from_zarr['site_code'] = sm_from_zarr.site_code.astype('str')
sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')

Before converting the datatype of the site_code coordinate is object

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.2
pandas: 0.25.1
numpy: 1.17.1
scipy: 1.3.1
netCDF4: 1.5.1.2
pydap: installed
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.3.0
distributed: 2.5.1
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.1.2
IPython: 7.8.0
sphinx: None

The text was updated successfully, but these errors were encountered:

jhamman · 2019-11-01T19:58:25Z

Hi @jsadler2 - Can you show us what sm_from_zarr looks like (print(sm_from_zarr)) will do.

jsadler2 · 2019-11-05T22:07:20Z

Sure

In [5]: print(ds)
<xarray.Dataset>
Dimensions:     (datetime: 20, site_code: 20)
Coordinates:
  * datetime    (datetime) datetime64[ns] 1970-01-01 ... 1970-01-01T19:00:00
  * site_code   (site_code) object '01302020' '01303000' ... '01315000'
Data variables:
    streamflow  (datetime, site_code) float64 dask.array<shape=(20, 20), chunksize=(20, 20)>

jhamman · 2019-11-05T22:48:59Z

Thanks @jsadler2 - I think this is a bug in xarray. We should be able to round trip the site_coordinate variable.

dnerini · 2020-04-27T16:22:35Z

I'm experiencing the same issue, which seems to be also related to one of my coordinates having object as datatype. Luckily, the workaround proposed by @jsadler2 works in my case, too.

borispf · 2020-05-19T14:38:26Z

I ran into the same issue. It seems like zarr is inserting VLenUTF8 as a filter, but the loaded data array already has that as a filter so it's trying to double encode . So another workaround is to delete the filters on site_code:

import xarray as xr

sm_from_zarr = xr.open_zarr('tmp/test_sm_zarr')
del sm_from_zarr.site_code.encoding["filters"]
sm_from_zarr.to_zarr('tmp/test_sm_zarr_from', mode='w')

andersy005 · 2020-12-15T16:11:39Z

I ran into the same issue. It seems like zarr is inserting VLenUTF8 as a filter, but the loaded data array already has that as a filter so it's trying to double encode

Indeed. And it appears that these lines are the culprits:

xarray/xarray/backends/zarr.py

Lines 468 to 471 in 83706af

    
           # new variable 
        
           encoding = extract_zarr_variable_encoding( 
        
               v, raise_on_invalid=check, name=vn 
        
           )

ipdb> v
<xarray.Variable (x: 3)>
array(['a', 'b', 'c'], dtype=object)
ipdb> check
False
ipdb> vn
'x'
ipdb> encoding = extract_zarr_variable_encoding(v, raise_on_invalid=check, name=vn)
ipdb> encoding
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': [VLenUTF8()]}

xarray/xarray/backends/zarr.py

Lines 480 to 482 in 83706af

    
           zarr_array = self.ds.create( 
        
               name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding 
        
           )

Zarr appears to be ignoring the filter information from xarray. Zarr proceeds to extracting its own filter. As a result, we end up with two filters:

ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding)
ipdb> self.ds['x']._meta
{'zarr_format': 2, 'shape': (3,), 'chunks': (3,), 'dtype': dtype('O'), 'compressor': {'blocksize': 0, 'clevel': 5, 'cname': 'lz4', 'id': 'blosc', 'shuffle': 1}, 'fill_value': None, 'order': 'C', 'filters': [{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]}

ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]

As @borispf and @jsadler2 suggested, clearing the filters from encoding before initiating the zarr store creation works:

ipdb> enc # without filters
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': []}
ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **enc)
ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}]

@jhamman since you are more familiar with the internals of zarr + xarray, should we default to ignoring filter information from xarray and let zarr take care of the extraction of filter information?

jhamman · 2020-12-16T16:09:52Z

Thanks @andersy005 for the write up and for digging into this. Doesn't this seem like it could be a bug in Zarr's create method?

Hoeze · 2021-05-15T16:56:00Z

Hi, I also keep running into this issue all the time.
Right now, there is no way of round-tripping xr.open_zarr().to_zarr(), also because of #5219.

The only workaround that seems to help is the following:

to_store = xrds.copy()
for var in to_store.variables:
    to_store[var].encoding.clear()

bluetyson · 2022-01-14T04:53:36Z

Thanks for that, looks like I just came across this.

RichardScottOZ · 2022-01-14T04:55:03Z

Think it worked with one variable - @Hoeze workaround was necessary for more than one.

delgadom · 2022-05-02T15:38:11Z

This has been happening a lot to me lately when writing to zarr. Thanks to @bolliger32 for the tip - this usually works like a charm:

for v in list(ds.coords.keys()):
    if ds.coords[v].dtype == object:
        ds.coords[v] = ds.coords[v].astype("unicode")

for v in list(ds.variables.keys()):
    if ds[v].dtype == object:
        ds[v] = ds[v].astype("unicode")

For whatever reason, clearing the encoding and/or using .astype(str) doesn't seem to work as reliably. I don't have a good MRE for this but hope it helps others with the same issue.

note the flag raised by @FlorisCalkoen below - don't just throw this at all your writes! there are other object types (e.g. CFTime) which you probably don't want to convert to string. This is just a patch to get around this issue for dataarrays with string coords/variables.

…rray#3476

FlorisCalkoen · 2022-08-04T14:36:26Z

@delgadom I just noticed that your proposed solution has the side effect of also converting cftime objects (e.g., below) to unicode strings.

xarray.DataArray 'time' (time: 1) array([cftime.DatetimeNoLeap(2007, 7, 2, 12, 0, 0, 0, has_year_zero=True)], dtype=object)

I updated your lines using @Hoeze' clear function and that seems to work for now.

    for v in list(ds.coords.keys()):
        if ds.coords[v].dtype == object:
            ds[v].encoding.clear()

    for v in list(ds.variables.keys()):
        if ds[v].dtype == object:
            ds[v].encoding.clear()

delgadom · 2022-08-08T15:30:52Z

ha - yeah that's a good flag. I definitely didn't intend for that to be a universally applied patch! so probably should have included a buyer beware. but we did find that clearing the encoding doesn't always do the trick for string arrays. So a comprehensive patch will probably need to be more nuanced.

saschahofmann · 2023-03-27T15:40:06Z

My impression is that keeping the zarr encoding leads to a bunch of issues (see my issue above) or the current one. There also seems to be an issue with chunking being preserved because the array encodings arent overwritten, but I cant find that issue right now. Since all of these issues are resolved by popping the zarr encoding I am wondering what are the downsides of this and whether it'd be easier to not keep that encoding at all?

max-sixty · 2023-11-09T06:46:27Z

I find this annoying too!

I think we can fold it into #6323, though it's also common enough that maybe we leave it open for future travellers

cisaacstern · 2024-01-25T17:39:41Z

I hit this issue this week, and the workaround in #3476 (comment) worked for me.

DahnJ · 2024-06-11T16:22:28Z

I run into this issue fairly commonly (usually during playing copying of a dataset xr.open_zarr(..).to_zarr(..)) and the fix in #3476 (comment) works for me.

jhamman added bug topic-backends topic-zarr Related to zarr storage library labels Nov 5, 2019

dcherian changed the title ~~Error when writing to zarr~~ Error when writing string coordinate variables to zarr May 2, 2022

tomwhite added a commit to tomwhite/sgkit that referenced this issue May 5, 2022

Remove VLenUTF8 from filters to avoid double encoding error pydata/xa…

f6e71e9

…rray#3476

tomwhite mentioned this issue May 5, 2022

Issue of vcf_to_zarr()'s .zarr output sgkit-dev/sgkit#785

Closed

tomwhite added a commit to tomwhite/sgkit that referenced this issue May 5, 2022

Remove VLenUTF8 from filters to avoid double encoding error pydata/xa…

8bb8ee6

…rray#3476

mergify bot pushed a commit to sgkit-dev/sgkit that referenced this issue May 5, 2022

Remove VLenUTF8 from filters to avoid double encoding error pydata/xa…

e0b3669

…rray#3476

aalok-sathe mentioned this issue May 20, 2022

implement zarr-based caching for major classes language-brainscore/langbrainscore#28

Open

juliettelavoie mentioned this issue Oct 14, 2022

Add warminglevels Ouranosinc/xscen#100

Merged

6 tasks

benjeffery mentioned this issue Jan 9, 2023

Loading then saving a VCF imported dataset fails with TypeError: expected unicode string, found 130 sgkit-dev/sgkit#991

Closed

saschahofmann mentioned this issue Mar 20, 2023

Rezarring an opened dataset with object dtype fails due to added filter #7576

Closed

4 tasks

jhamman mentioned this issue Mar 27, 2023

Add reset_encoding to Dataset and DataArray objects #7686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when writing string coordinate variables to zarr #3476

Error when writing string coordinate variables to zarr #3476

jsadler2 commented Nov 1, 2019

jhamman commented Nov 1, 2019 •

edited

Loading

jsadler2 commented Nov 5, 2019

jhamman commented Nov 5, 2019

dnerini commented Apr 27, 2020

borispf commented May 19, 2020

andersy005 commented Dec 15, 2020

jhamman commented Dec 16, 2020

Hoeze commented May 15, 2021 •

edited

Loading

bluetyson commented Jan 14, 2022

RichardScottOZ commented Jan 14, 2022

delgadom commented May 2, 2022 •

edited

Loading

FlorisCalkoen commented Aug 4, 2022

delgadom commented Aug 8, 2022 •

edited

Loading

saschahofmann commented Mar 27, 2023

max-sixty commented Nov 9, 2023

cisaacstern commented Jan 25, 2024

DahnJ commented Jun 11, 2024

Error when writing string coordinate variables to zarr #3476

Error when writing string coordinate variables to zarr #3476

Comments

jsadler2 commented Nov 1, 2019

MCVE Code Sample

Expected Output

Problem Description

Output of xr.show_versions()

jhamman commented Nov 1, 2019 • edited Loading

jsadler2 commented Nov 5, 2019

jhamman commented Nov 5, 2019

dnerini commented Apr 27, 2020

borispf commented May 19, 2020

andersy005 commented Dec 15, 2020

jhamman commented Dec 16, 2020

Hoeze commented May 15, 2021 • edited Loading

bluetyson commented Jan 14, 2022

RichardScottOZ commented Jan 14, 2022

delgadom commented May 2, 2022 • edited Loading

FlorisCalkoen commented Aug 4, 2022

delgadom commented Aug 8, 2022 • edited Loading

saschahofmann commented Mar 27, 2023

max-sixty commented Nov 9, 2023

cisaacstern commented Jan 25, 2024

DahnJ commented Jun 11, 2024

Output of `xr.show_versions()`

jhamman commented Nov 1, 2019 •

edited

Loading

Hoeze commented May 15, 2021 •

edited

Loading

delgadom commented May 2, 2022 •

edited

Loading

delgadom commented Aug 8, 2022 •

edited

Loading