Skip to content

Commit

Permalink
Non-kerchunk backend for HDF5/netcdf4 files. (#87)
Browse files Browse the repository at this point in the history
* Generate chunk manifest backed variable from HDF5 dataset.

* Transfer dataset attrs to variable.

* Get virtual variables dict from HDF5 file.

* Update virtual_vars_from_hdf to use fsspec and drop_variables arg.

* mypy fix to use ChunkKey and empty dimensions list.

* Extract attributes from hdf5 root group.

* Use hdf reader for netcdf4 files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix ruff complaints.

* First steps for handling HDF5 filters.

* Initial step for hdf5plugin supported codecs.

* Small commit to check compression support in CI environment.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix mypy complaints for hdf_filters.

* Local pre-commit fix for hdf_filters.

* Use fsspec reader_options introduced in #37.

* Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.

* Fix early return from hdf _extract_attrs.

* Test that _extract_attrs correctly handles multiple attributes.

* Initial attempt at scale and offset via numcodecs.

* Tests for cfcodec_from_dataset.

* Temporarily relax integration tests to assert_allclose.

* Add blosc_lz4 fixture parameterization to confirm libnetcdf environment.

* Check for compatability with netcdf4 engine.

* Use separate fixtures for h5netcdf and netcdf4 compression styles.

* Print libhdf5 and libnetcdf4 versions to confirm compiled environment.

* Skip netcdf4 style compression tests when libhdf5 < 1.14.

* Include imagecodecs.numcodecs to support HDF5 lzf filters.

* Remove test that verifies call to read_kerchunk_references_from_file.

* Add additional codec support structures for imagecodecs and numcodecs.

* Add codec config test for Zstd.

* Include initial cf decoding tests.

* Revert typo for scale_factor retrieval.

* Update reader to use new numpy manifest representation.

* Temporarily skip test until blosc netcdf4 issue is solved.

* Fix Pydantic 2 migration warnings.

* Include hdf5plugin and imagecodecs-numcodecs in mamba test environment.

* Mamba attempt with imagecodecs rather than imagecodecs-numcodecs.

* Mamba attempt with latest imagecodecs release.

* Use correct iter_chunks callback function signtature.

* Include pip based imagecodecs-numcodecs until conda-forge availability.

* Handle non-coordinate dims which are serialized to hdf as empty dataset.

* Use reader_options for filetype check and update failing kerchunk call.

* Fix chunkmanifest shaping for chunked datasets.

* Handle scale_factor attribute serialization for compressed files.

* Include chunked roundtrip fixture.

* Standardize xarray integration tests for hdf filters.

* Update reader selection logic for new filetype determination.

* Use decode_times for integration test.

* Standardize fixture names for hdf5 vs netcdf4 file types.

* Handle array add_offset property for compressed data.

* Include h5py shuffle filter.

* Make ScaleAndOffset codec last in filters list.

* Apply ScaleAndOffset codec to _FillValue since it's value is now downstream.

* Coerce scale and add_offset values to native float for JSON serialization.

* Temporarily xfail integration tests for main

* Remove pydantic dependency as per pull/210.

* Update test for new kerchunk reader module location.

* Fix branch typing errors.

* Re-include automatic file type determination.

* Handle various hdf flavors of _FillValue storage.

* Include loadable variables in drop variables list.

* Mock readers.hdf.virtual_vars_from_hdf to verify option passing.

* Convert numpy _FillValue to native Python for serialization support.

* Support groups with HDF5 reader.

* Handle empty variables with a shape.

* Import top-level version of xarray classes.

* Add option to explicitly specify use of an experimental hdf backend.

* Include imagecodecs and hdf5plugin in all CI environments.

* Add test_hdf_integration tests to be skipped for non-kerchunk env.

* Include imagecodecs in dependencies.

* Diagnose imagecodecs-numcodecs installation failures in CI.

* Ignore mypy complaints for VirtualBackend.

* Remove checksum assert which varies across different zstd versions.

* Temporarily xfail integration tests with coordinate inconsistency.

* Remove backend arg for non-hdf network file tests.

* Fix mypy comment moved by ruff formatting.

* Make HDR reader dependencies optional.

* Handle optional imagecodecs and hdf5plugin dependency imports for tests.

* Prevent conflicts with explicit filetype and backend args.

* Correctly convert root coordinate attributes to a list.

* Clarify that method extracts attrs from any specified group.

* Restructure hdf reader and codec filters into a module namespace.

* Improve docstrings for hdf and filter modules.

* Explicitly specify HDF5VirtualBackend for test parameter.

* Include isssue references for xfailed tests.

* Use soft import strategy for optional dependencies see xarray/issues/9554.

* Handle mypy for soft imports.

* Attempt at nested optional depedency usage.

* Handle use of soft import sub modules for typing.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
sharkinsspatial and pre-commit-ci[bot] authored Nov 19, 2024
1 parent 5c3e204 commit 647d175
Show file tree
Hide file tree
Showing 18 changed files with 1,439 additions and 64 deletions.
6 changes: 6 additions & 0 deletions ci/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ dependencies:
- ujson
- packaging
- universal_pathlib
- hdf5plugin
- numcodecs
# Testing
- codecov
- pre-commit
Expand All @@ -27,7 +29,11 @@ dependencies:
- fsspec
- s3fs
- fastparquet
- imagecodecs>=2024.6.1
# for opening tiff files
- tifffile
# for opening FITS files
- astropy
- pip
- pip:
- imagecodecs-numcodecs==2024.6.1
1 change: 0 additions & 1 deletion ci/min-deps.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ channels:
- conda-forge
- nodefaults
dependencies:
- h5netcdf
- h5py
- hdf5
- netcdf4
Expand Down
4 changes: 4 additions & 0 deletions ci/upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ dependencies:
- packaging
- ujson
- universal_pathlib
- hdf5plugin
- numcodecs
- imagecodecs>=2024.6.1
# Testing
- codecov
- pre-commit
Expand All @@ -27,3 +30,4 @@ dependencies:
- pip:
- icechunk # Installs zarr v3 as dependency
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
- imagecodecs-numcodecs==2024.6.1
11 changes: 10 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,23 @@ dependencies = [
]

[project.optional-dependencies]
hdf_reader = [
"fsspec",
"h5py",
"hdf5plugin",
"imagecodecs",
"imagecodecs-numcodecs==2024.6.1",
"numcodecs"
]
test = [
"codecov",
"fastparquet",
"fsspec",
"h5netcdf",
"h5py",
"kerchunk>=0.2.5",
"mypy",
"netcdf4",
"numcodecs",
"pandas-stubs",
"pooch",
"pre-commit",
Expand All @@ -48,6 +56,7 @@ test = [
"ruff",
"s3fs",
"scipy",
"virtualizarr[hdf_reader]"
]


Expand Down
16 changes: 12 additions & 4 deletions virtualizarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
HDF5VirtualBackend,
KerchunkVirtualBackend,
NetCDF3VirtualBackend,
TIFFVirtualBackend,
ZarrV3VirtualBackend,
)
from virtualizarr.readers.common import VirtualBackend
from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions

# TODO add entrypoint to allow external libraries to add to this mapping
Expand All @@ -26,10 +28,10 @@
"zarr_v3": ZarrV3VirtualBackend,
"dmrpp": DMRPPVirtualBackend,
# all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends)
"netcdf3": NetCDF3VirtualBackend,
"hdf5": HDF5VirtualBackend,
"netcdf4": HDF5VirtualBackend, # note this is the same as for hdf5
# "tiff": TIFFVirtualBackend,
"netcdf3": NetCDF3VirtualBackend,
"tiff": TIFFVirtualBackend,
"fits": FITSVirtualBackend,
}

Expand Down Expand Up @@ -112,6 +114,7 @@ def open_virtual_dataset(
indexes: Mapping[str, Index] | None = None,
virtual_array_class=ManifestArray,
reader_options: Optional[dict] = None,
backend: Optional[VirtualBackend] = None,
) -> Dataset:
"""
Open a file or store as an xarray Dataset wrapping virtualized zarr arrays.
Expand Down Expand Up @@ -173,15 +176,20 @@ def open_virtual_dataset(
if reader_options is None:
reader_options = {}

if backend and filetype:
raise ValueError("Cannot pass both a filetype and an explicit VirtualBackend")

if filetype is not None:
# if filetype is user defined, convert to FileType
filetype = FileType(filetype)
else:
filetype = automatically_determine_filetype(
filepath=filepath, reader_options=reader_options
)

backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower())
if backend:
backend_cls = backend
else:
backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower()) # type: ignore

if backend_cls is None:
raise NotImplementedError(f"Unsupported file type: {filetype.name}")
Expand Down
2 changes: 2 additions & 0 deletions virtualizarr/readers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from virtualizarr.readers.dmrpp import DMRPPVirtualBackend
from virtualizarr.readers.fits import FITSVirtualBackend
from virtualizarr.readers.hdf import HDFVirtualBackend
from virtualizarr.readers.hdf5 import HDF5VirtualBackend
from virtualizarr.readers.kerchunk import KerchunkVirtualBackend
from virtualizarr.readers.netcdf3 import NetCDF3VirtualBackend
Expand All @@ -9,6 +10,7 @@
__all__ = [
"DMRPPVirtualBackend",
"FITSVirtualBackend",
"HDFVirtualBackend",
"HDF5VirtualBackend",
"KerchunkVirtualBackend",
"NetCDF3VirtualBackend",
Expand Down
11 changes: 11 additions & 0 deletions virtualizarr/readers/hdf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from .hdf import (
HDFVirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
)

__all__ = [
"HDFVirtualBackend",
"construct_virtual_dataset",
"open_loadable_vars_and_indexes",
]
195 changes: 195 additions & 0 deletions virtualizarr/readers/hdf/filters.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
import dataclasses
from typing import TYPE_CHECKING, List, Tuple, TypedDict, Union

import numcodecs.registry as registry
import numpy as np
from numcodecs.abc import Codec
from numcodecs.fixedscaleoffset import FixedScaleOffset
from xarray.coding.variables import _choose_float_dtype

from virtualizarr.utils import soft_import

if TYPE_CHECKING:
import h5py # type: ignore
from h5py import Dataset, Group # type: ignore

h5py = soft_import("h5py", "For reading hdf files", strict=False)
if h5py:
Dataset = h5py.Dataset
Group = h5py.Group
else:
Dataset = dict()
Group = dict()

hdf5plugin = soft_import(
"hdf5plugin", "For reading hdf files with filters", strict=False
)
imagecodecs = soft_import(
"imagecodecs", "For reading hdf files with filters", strict=False
)


_non_standard_filters = {
"gzip": "zlib",
"lzf": "imagecodecs_lzf",
}

_hdf5plugin_imagecodecs = {"lz4": "imagecodecs_lz4h5", "bzip2": "imagecodecs_bz2"}


@dataclasses.dataclass
class BloscProperties:
blocksize: int
clevel: int
shuffle: int
cname: str

def __post_init__(self):
blosc_compressor_codes = {
value: key
for key, value in hdf5plugin._filters.Blosc._Blosc__COMPRESSIONS.items()
}
self.cname = blosc_compressor_codes[self.cname]


@dataclasses.dataclass
class ZstdProperties:
level: int


@dataclasses.dataclass
class ShuffleProperties:
elementsize: int


@dataclasses.dataclass
class ZlibProperties:
level: int


class CFCodec(TypedDict):
target_dtype: np.dtype
codec: Codec


def _filter_to_codec(
filter_id: str, filter_properties: Union[int, None, Tuple] = None
) -> Codec:
"""
Convert an h5py filter to an equivalent numcodec
Parameters
----------
filter_id: str
An h5py filter id code.
filter_properties : int or None or Tuple
A single or Tuple of h5py filter configuration codes.
Returns
-------
A numcodec codec
"""
id_int = None
id_str = None
try:
id_int = int(filter_id)
except ValueError:
id_str = filter_id
conf = {}
if id_str:
if id_str in _non_standard_filters.keys():
id = _non_standard_filters[id_str]
else:
id = id_str
if id == "zlib":
zlib_props = ZlibProperties(level=filter_properties) # type: ignore
conf = dataclasses.asdict(zlib_props)
if id == "shuffle" and isinstance(filter_properties, tuple):
shuffle_props = ShuffleProperties(elementsize=filter_properties[0])
conf = dataclasses.asdict(shuffle_props)
conf["id"] = id # type: ignore[assignment]
if id_int:
filter = hdf5plugin.get_filters(id_int)[0]
id = filter.filter_name
if id in _hdf5plugin_imagecodecs.keys():
id = _hdf5plugin_imagecodecs[id]
if id == "blosc" and isinstance(filter_properties, tuple):
blosc_fields = [field.name for field in dataclasses.fields(BloscProperties)]
blosc_props = BloscProperties(
**{k: v for k, v in zip(blosc_fields, filter_properties[-4:])}
)
conf = dataclasses.asdict(blosc_props)
if id == "zstd" and isinstance(filter_properties, tuple):
zstd_props = ZstdProperties(level=filter_properties[0])
conf = dataclasses.asdict(zstd_props)
conf["id"] = id
codec = registry.get_codec(conf)
return codec


def cfcodec_from_dataset(dataset: Dataset) -> Codec | None:
"""
Converts select h5py dataset CF convention attrs to CFCodec
Parameters
----------
dataset: h5py.Dataset
An h5py dataset.
Returns
-------
CFCodec
A CFCodec.
"""
attributes = {attr: dataset.attrs[attr] for attr in dataset.attrs}
mapping = {}
if "scale_factor" in attributes:
try:
scale_factor = attributes["scale_factor"][0]
except IndexError:
scale_factor = attributes["scale_factor"]
mapping["scale_factor"] = float(1 / scale_factor)
else:
mapping["scale_factor"] = 1
if "add_offset" in attributes:
try:
offset = attributes["add_offset"][0]
except IndexError:
offset = attributes["add_offset"]
mapping["add_offset"] = float(offset)
else:
mapping["add_offset"] = 0
if mapping["scale_factor"] != 1 or mapping["add_offset"] != 0:
float_dtype = _choose_float_dtype(dtype=dataset.dtype, mapping=mapping)
target_dtype = np.dtype(float_dtype)
codec = FixedScaleOffset(
offset=mapping["add_offset"],
scale=mapping["scale_factor"],
dtype=target_dtype,
astype=dataset.dtype,
)
cfcodec = CFCodec(target_dtype=target_dtype, codec=codec)
return cfcodec
else:
return None


def codecs_from_dataset(dataset: Dataset) -> List[Codec]:
"""
Extracts a list of numcodecs from an h5py dataset
Parameters
----------
dataset: h5py.Dataset
An h5py dataset.
Returns
-------
list
A list of numcodecs codecs.
"""
codecs = []
for filter_id, filter_properties in dataset._filters.items():
codec = _filter_to_codec(filter_id, filter_properties)
codecs.append(codec)
return codecs
Loading

0 comments on commit 647d175

Please sign in to comment.