Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-kerchunk backend for HDF5/netcdf4 files. #87

Merged
merged 106 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
6b7abe2
Generate chunk manifest backed variable from HDF5 dataset.
sharkinsspatial Apr 19, 2024
bca0aab
Transfer dataset attrs to variable.
sharkinsspatial Apr 19, 2024
384ff6b
Get virtual variables dict from HDF5 file.
sharkinsspatial Apr 19, 2024
4c5f9bd
Update virtual_vars_from_hdf to use fsspec and drop_variables arg.
sharkinsspatial Apr 22, 2024
1dd3370
mypy fix to use ChunkKey and empty dimensions list.
sharkinsspatial Apr 22, 2024
d92c75c
Extract attributes from hdf5 root group.
sharkinsspatial Apr 22, 2024
0ed8362
Use hdf reader for netcdf4 files.
sharkinsspatial Apr 22, 2024
f4485fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 22, 2024
3cc1254
Merge branch 'main' into hdf5_reader
sharkinsspatial May 8, 2024
0123df7
Fix ruff complaints.
sharkinsspatial May 9, 2024
332bcaa
First steps for handling HDF5 filters.
sharkinsspatial May 10, 2024
c51e615
Initial step for hdf5plugin supported codecs.
sharkinsspatial May 13, 2024
0083f77
Small commit to check compression support in CI environment.
sharkinsspatial May 16, 2024
3c00071
Merge branch 'main' into hdf5_reader
sharkinsspatial May 18, 2024
207c4b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 19, 2024
c573800
Fix mypy complaints for hdf_filters.
sharkinsspatial May 19, 2024
ef0d7a8
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial May 19, 2024
588e06b
Local pre-commit fix for hdf_filters.
sharkinsspatial May 19, 2024
725333e
Use fsspec reader_options introduced in #37.
sharkinsspatial May 21, 2024
72df108
Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.
sharkinsspatial May 21, 2024
d1e85cb
Fix early return from hdf _extract_attrs.
sharkinsspatial May 21, 2024
1e2b343
Test that _extract_attrs correctly handles multiple attributes.
sharkinsspatial May 21, 2024
7f1c189
Initial attempt at scale and offset via numcodecs.
sharkinsspatial May 22, 2024
908e332
Tests for cfcodec_from_dataset.
sharkinsspatial May 23, 2024
0df332d
Temporarily relax integration tests to assert_allclose.
sharkinsspatial May 24, 2024
ca6b236
Add blosc_lz4 fixture parameterization to confirm libnetcdf environment.
sharkinsspatial May 24, 2024
b7426c5
Check for compatability with netcdf4 engine.
sharkinsspatial May 24, 2024
dac21dd
Use separate fixtures for h5netcdf and netcdf4 compression styles.
sharkinsspatial May 27, 2024
e968772
Print libhdf5 and libnetcdf4 versions to confirm compiled environment.
sharkinsspatial May 27, 2024
9a98e57
Skip netcdf4 style compression tests when libhdf5 < 1.14.
sharkinsspatial May 27, 2024
7590b87
Include imagecodecs.numcodecs to support HDF5 lzf filters.
sharkinsspatial Jun 11, 2024
e9fbc8a
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 11, 2024
14bd709
Remove test that verifies call to read_kerchunk_references_from_file.
sharkinsspatial Jun 11, 2024
acdf0d7
Add additional codec support structures for imagecodecs and numcodecs.
sharkinsspatial Jun 12, 2024
4ba323a
Add codec config test for Zstd.
sharkinsspatial Jun 12, 2024
e14e53b
Include initial cf decoding tests.
sharkinsspatial Jun 21, 2024
b808ded
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 21, 2024
b052f8c
Revert typo for scale_factor retrieval.
sharkinsspatial Jun 21, 2024
01a3980
Update reader to use new numpy manifest representation.
sharkinsspatial Jun 21, 2024
c37d9e5
Temporarily skip test until blosc netcdf4 issue is solved.
sharkinsspatial Jun 22, 2024
17b30d4
Fix Pydantic 2 migration warnings.
sharkinsspatial Jun 22, 2024
f6b596a
Include hdf5plugin and imagecodecs-numcodecs in mamba test environment.
sharkinsspatial Jun 22, 2024
eb6e24d
Mamba attempt with imagecodecs rather than imagecodecs-numcodecs.
sharkinsspatial Jun 22, 2024
c85bd16
Mamba attempt with latest imagecodecs release.
sharkinsspatial Jun 22, 2024
ca435da
Use correct iter_chunks callback function signtature.
sharkinsspatial Jun 26, 2024
3017951
Include pip based imagecodecs-numcodecs until conda-forge availability.
sharkinsspatial Jun 26, 2024
ccf0b73
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 26, 2024
32ba135
Handle non-coordinate dims which are serialized to hdf as empty dataset.
sharkinsspatial Jun 27, 2024
64f446c
Use reader_options for filetype check and update failing kerchunk call.
sharkinsspatial Jun 27, 2024
1c590bb
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 27, 2024
9797346
Fix chunkmanifest shaping for chunked datasets.
sharkinsspatial Jun 30, 2024
c833e19
Handle scale_factor attribute serialization for compressed files.
sharkinsspatial Jun 30, 2024
701bcfa
Include chunked roundtrip fixture.
sharkinsspatial Jun 30, 2024
08c988e
Standardize xarray integration tests for hdf filters.
sharkinsspatial Jun 30, 2024
e6076bd
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial Jun 30, 2024
d684a84
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 30, 2024
4cb4bac
Update reader selection logic for new filetype determination.
sharkinsspatial Jun 30, 2024
d352104
Use decode_times for integration test.
sharkinsspatial Jun 30, 2024
3d89ea4
Standardize fixture names for hdf5 vs netcdf4 file types.
sharkinsspatial Jun 30, 2024
c9dd0d9
Handle array add_offset property for compressed data.
sharkinsspatial Jul 1, 2024
db5b421
Include h5py shuffle filter.
sharkinsspatial Jul 1, 2024
9a1da32
Make ScaleAndOffset codec last in filters list.
sharkinsspatial Jul 1, 2024
9b2b0f8
Apply ScaleAndOffset codec to _FillValue since it's value is now down…
sharkinsspatial Jul 2, 2024
9ef1362
Coerce scale and add_offset values to native float for JSON serializa…
sharkinsspatial Jul 2, 2024
30005bd
Merge branch 'main' into hdf5_reader
sharkinsspatial Aug 6, 2024
14f7a99
Merge branch 'main' into hdf5_reader
sharkinsspatial Aug 6, 2024
f4f9c8f
Temporarily xfail integration tests for main
sharkinsspatial Aug 9, 2024
d257cb9
Merge branch 'main' into hdf5_reader
sharkinsspatial Oct 2, 2024
e795c2c
Merge branch 'main' into hdf5_reader
sharkinsspatial Oct 8, 2024
a9e59f2
Remove pydantic dependency as per pull/210.
sharkinsspatial Oct 8, 2024
2b33bc2
Update test for new kerchunk reader module location.
sharkinsspatial Oct 8, 2024
a57ae9e
Fix branch typing errors.
sharkinsspatial Oct 9, 2024
e21fc69
Re-include automatic file type determination.
sharkinsspatial Oct 9, 2024
df69a12
Handle various hdf flavors of _FillValue storage.
sharkinsspatial Oct 9, 2024
169337c
Include loadable variables in drop variables list.
sharkinsspatial Oct 9, 2024
bdcbfbf
Mock readers.hdf.virtual_vars_from_hdf to verify option passing.
sharkinsspatial Oct 9, 2024
77f1689
Convert numpy _FillValue to native Python for serialization support.
sharkinsspatial Oct 9, 2024
42c653a
Support groups with HDF5 reader.
sharkinsspatial Oct 10, 2024
9c86e0d
Handle empty variables with a shape.
sharkinsspatial Oct 17, 2024
001a4a7
Merge branch 'main' into hdf5_reader
sharkinsspatial Oct 23, 2024
79f9921
Merge branch 'main' into hdf5_reader
sharkinsspatial Oct 23, 2024
1589776
Import top-level version of xarray classes.
sharkinsspatial Oct 23, 2024
772c580
Add option to explicitly specify use of an experimental hdf backend.
sharkinsspatial Oct 24, 2024
3ab90c6
Include imagecodecs and hdf5plugin in all CI environments.
sharkinsspatial Oct 24, 2024
150d06d
Add test_hdf_integration tests to be skipped for non-kerchunk env.
sharkinsspatial Oct 24, 2024
8ccba34
Include imagecodecs in dependencies.
sharkinsspatial Oct 24, 2024
81874e0
Diagnose imagecodecs-numcodecs installation failures in CI.
sharkinsspatial Oct 24, 2024
f87abe2
Ignore mypy complaints for VirtualBackend.
sharkinsspatial Oct 24, 2024
70e7e29
Remove checksum assert which varies across different zstd versions.
sharkinsspatial Oct 24, 2024
43bc0e4
Temporarily xfail integration tests with coordinate inconsistency.
sharkinsspatial Oct 24, 2024
82a6321
Remove backend arg for non-hdf network file tests.
sharkinsspatial Oct 24, 2024
b34f260
Fix mypy comment moved by ruff formatting.
sharkinsspatial Oct 24, 2024
f9ead06
Make HDR reader dependencies optional.
sharkinsspatial Oct 25, 2024
5608292
Handle optional imagecodecs and hdf5plugin dependency imports for tests.
sharkinsspatial Oct 25, 2024
2fa548c
Prevent conflicts with explicit filetype and backend args.
sharkinsspatial Nov 11, 2024
bc0d925
Correctly convert root coordinate attributes to a list.
sharkinsspatial Nov 13, 2024
783df94
Clarify that method extracts attrs from any specified group.
sharkinsspatial Nov 14, 2024
16f288b
Restructure hdf reader and codec filters into a module namespace.
sharkinsspatial Nov 14, 2024
3e216dc
Improve docstrings for hdf and filter modules.
sharkinsspatial Nov 14, 2024
5b085a6
Explicitly specify HDF5VirtualBackend for test parameter.
sharkinsspatial Nov 14, 2024
83ff577
Include isssue references for xfailed tests.
sharkinsspatial Nov 15, 2024
ee6fa0b
Use soft import strategy for optional dependencies see xarray/issues/…
sharkinsspatial Nov 18, 2024
44bce08
Merge branch 'main' into hdf5_reader
sharkinsspatial Nov 18, 2024
5de9d2c
Handle mypy for soft imports.
sharkinsspatial Nov 18, 2024
a8cc82f
Attempt at nested optional depedency usage.
sharkinsspatial Nov 18, 2024
65a6b14
Handle use of soft import sub modules for typing.
sharkinsspatial Nov 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions ci/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ dependencies:
- ujson
- packaging
- universal_pathlib
- hdf5plugin
- numcodecs
# Testing
- codecov
- pre-commit
Expand All @@ -27,7 +29,11 @@ dependencies:
- fsspec
- s3fs
- fastparquet
- imagecodecs>=2024.6.1
# for opening tiff files
- tifffile
# for opening FITS files
- astropy
- pip
- pip:
- imagecodecs-numcodecs==2024.6.1
1 change: 0 additions & 1 deletion ci/min-deps.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ channels:
- conda-forge
- nodefaults
dependencies:
- h5netcdf
- h5py
- hdf5
- netcdf4
Expand Down
4 changes: 4 additions & 0 deletions ci/upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ dependencies:
- packaging
- ujson
- universal_pathlib
- hdf5plugin
- numcodecs
- imagecodecs>=2024.6.1
# Testing
- codecov
- pre-commit
Expand All @@ -27,3 +30,4 @@ dependencies:
- pip:
- icechunk # Installs zarr v3 as dependency
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
- imagecodecs-numcodecs==2024.6.1
11 changes: 10 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,23 @@ dependencies = [
]

[project.optional-dependencies]
hdf_reader = [
"fsspec",
"h5py",
"hdf5plugin",
"imagecodecs",
"imagecodecs-numcodecs==2024.6.1",
"numcodecs"
]
test = [
"codecov",
"fastparquet",
"fsspec",
"h5netcdf",
"h5py",
"kerchunk>=0.2.5",
"mypy",
"netcdf4",
"numcodecs",
"pandas-stubs",
"pooch",
"pre-commit",
Expand All @@ -48,6 +56,7 @@ test = [
"ruff",
"s3fs",
"scipy",
"virtualizarr[hdf_reader]"
]


Expand Down
16 changes: 12 additions & 4 deletions virtualizarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
HDF5VirtualBackend,
KerchunkVirtualBackend,
NetCDF3VirtualBackend,
TIFFVirtualBackend,
ZarrV3VirtualBackend,
)
from virtualizarr.readers.common import VirtualBackend
from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions

# TODO add entrypoint to allow external libraries to add to this mapping
Expand All @@ -26,10 +28,10 @@
"zarr_v3": ZarrV3VirtualBackend,
"dmrpp": DMRPPVirtualBackend,
# all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends)
"netcdf3": NetCDF3VirtualBackend,
"hdf5": HDF5VirtualBackend,
"netcdf4": HDF5VirtualBackend, # note this is the same as for hdf5
# "tiff": TIFFVirtualBackend,
"netcdf3": NetCDF3VirtualBackend,
sharkinsspatial marked this conversation as resolved.
Show resolved Hide resolved
"tiff": TIFFVirtualBackend,
"fits": FITSVirtualBackend,
}

Expand Down Expand Up @@ -112,6 +114,7 @@ def open_virtual_dataset(
indexes: Mapping[str, Index] | None = None,
virtual_array_class=ManifestArray,
reader_options: Optional[dict] = None,
backend: Optional[VirtualBackend] = None,
) -> Dataset:
"""
Open a file or store as an xarray Dataset wrapping virtualized zarr arrays.
Expand Down Expand Up @@ -173,15 +176,20 @@ def open_virtual_dataset(
if reader_options is None:
reader_options = {}

if backend and filetype:
raise ValueError("Cannot pass both a filetype and an explicit VirtualBackend")

if filetype is not None:
# if filetype is user defined, convert to FileType
filetype = FileType(filetype)
else:
filetype = automatically_determine_filetype(
filepath=filepath, reader_options=reader_options
)

backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower())
if backend:
backend_cls = backend
else:
backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower()) # type: ignore
sharkinsspatial marked this conversation as resolved.
Show resolved Hide resolved

if backend_cls is None:
raise NotImplementedError(f"Unsupported file type: {filetype.name}")
Expand Down
2 changes: 2 additions & 0 deletions virtualizarr/readers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from virtualizarr.readers.dmrpp import DMRPPVirtualBackend
from virtualizarr.readers.fits import FITSVirtualBackend
from virtualizarr.readers.hdf import HDFVirtualBackend
from virtualizarr.readers.hdf5 import HDF5VirtualBackend
from virtualizarr.readers.kerchunk import KerchunkVirtualBackend
from virtualizarr.readers.netcdf3 import NetCDF3VirtualBackend
Expand All @@ -9,6 +10,7 @@
__all__ = [
"DMRPPVirtualBackend",
"FITSVirtualBackend",
"HDFVirtualBackend",
"HDF5VirtualBackend",
"KerchunkVirtualBackend",
"NetCDF3VirtualBackend",
Expand Down
11 changes: 11 additions & 0 deletions virtualizarr/readers/hdf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from .hdf import (
HDFVirtualBackend,
construct_virtual_dataset,
open_loadable_vars_and_indexes,
)

__all__ = [
"HDFVirtualBackend",
"construct_virtual_dataset",
"open_loadable_vars_and_indexes",
]
195 changes: 195 additions & 0 deletions virtualizarr/readers/hdf/filters.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
import dataclasses
from typing import TYPE_CHECKING, List, Tuple, TypedDict, Union

import numcodecs.registry as registry
import numpy as np
from numcodecs.abc import Codec
from numcodecs.fixedscaleoffset import FixedScaleOffset
from xarray.coding.variables import _choose_float_dtype

from virtualizarr.utils import soft_import

if TYPE_CHECKING:
import h5py # type: ignore
from h5py import Dataset, Group # type: ignore

h5py = soft_import("h5py", "For reading hdf files", strict=False)
if h5py:
Dataset = h5py.Dataset
Group = h5py.Group
else:
Dataset = dict()
Group = dict()

hdf5plugin = soft_import(
"hdf5plugin", "For reading hdf files with filters", strict=False
)
imagecodecs = soft_import(
"imagecodecs", "For reading hdf files with filters", strict=False
)


_non_standard_filters = {
"gzip": "zlib",
"lzf": "imagecodecs_lzf",
}

_hdf5plugin_imagecodecs = {"lz4": "imagecodecs_lz4h5", "bzip2": "imagecodecs_bz2"}


@dataclasses.dataclass
class BloscProperties:
blocksize: int
clevel: int
shuffle: int
cname: str

def __post_init__(self):
blosc_compressor_codes = {
value: key
for key, value in hdf5plugin._filters.Blosc._Blosc__COMPRESSIONS.items()
}
self.cname = blosc_compressor_codes[self.cname]


@dataclasses.dataclass
class ZstdProperties:
level: int


@dataclasses.dataclass
class ShuffleProperties:
elementsize: int


@dataclasses.dataclass
class ZlibProperties:
level: int


class CFCodec(TypedDict):
target_dtype: np.dtype
codec: Codec


def _filter_to_codec(
filter_id: str, filter_properties: Union[int, None, Tuple] = None
) -> Codec:
"""
Convert an h5py filter to an equivalent numcodec

Parameters
----------
filter_id: str
An h5py filter id code.
filter_properties : int or None or Tuple
A single or Tuple of h5py filter configuration codes.

Returns
-------
A numcodec codec
"""
id_int = None
id_str = None
try:
id_int = int(filter_id)
except ValueError:
id_str = filter_id
conf = {}
if id_str:
if id_str in _non_standard_filters.keys():
id = _non_standard_filters[id_str]
else:
id = id_str
if id == "zlib":
zlib_props = ZlibProperties(level=filter_properties) # type: ignore
conf = dataclasses.asdict(zlib_props)
if id == "shuffle" and isinstance(filter_properties, tuple):
shuffle_props = ShuffleProperties(elementsize=filter_properties[0])
conf = dataclasses.asdict(shuffle_props)
conf["id"] = id # type: ignore[assignment]
if id_int:
filter = hdf5plugin.get_filters(id_int)[0]
id = filter.filter_name
if id in _hdf5plugin_imagecodecs.keys():
id = _hdf5plugin_imagecodecs[id]
if id == "blosc" and isinstance(filter_properties, tuple):
blosc_fields = [field.name for field in dataclasses.fields(BloscProperties)]
blosc_props = BloscProperties(
**{k: v for k, v in zip(blosc_fields, filter_properties[-4:])}
)
conf = dataclasses.asdict(blosc_props)
if id == "zstd" and isinstance(filter_properties, tuple):
zstd_props = ZstdProperties(level=filter_properties[0])
conf = dataclasses.asdict(zstd_props)
conf["id"] = id
codec = registry.get_codec(conf)
return codec


def cfcodec_from_dataset(dataset: Dataset) -> Codec | None:
"""
Converts select h5py dataset CF convention attrs to CFCodec

Parameters
----------
dataset: h5py.Dataset
An h5py dataset.

Returns
-------
CFCodec
A CFCodec.
"""
attributes = {attr: dataset.attrs[attr] for attr in dataset.attrs}
mapping = {}
if "scale_factor" in attributes:
try:
scale_factor = attributes["scale_factor"][0]
except IndexError:
scale_factor = attributes["scale_factor"]
mapping["scale_factor"] = float(1 / scale_factor)
else:
mapping["scale_factor"] = 1
if "add_offset" in attributes:
try:
offset = attributes["add_offset"][0]
except IndexError:
offset = attributes["add_offset"]
mapping["add_offset"] = float(offset)
else:
mapping["add_offset"] = 0
if mapping["scale_factor"] != 1 or mapping["add_offset"] != 0:
float_dtype = _choose_float_dtype(dtype=dataset.dtype, mapping=mapping)
target_dtype = np.dtype(float_dtype)
codec = FixedScaleOffset(
offset=mapping["add_offset"],
scale=mapping["scale_factor"],
dtype=target_dtype,
astype=dataset.dtype,
)
cfcodec = CFCodec(target_dtype=target_dtype, codec=codec)
return cfcodec
else:
return None


def codecs_from_dataset(dataset: Dataset) -> List[Codec]:
"""
Extracts a list of numcodecs from an h5py dataset

Parameters
----------
dataset: h5py.Dataset
An h5py dataset.

Returns
-------
list
A list of numcodecs codecs.
"""
codecs = []
for filter_id, filter_properties in dataset._filters.items():
codec = _filter_to_codec(filter_id, filter_properties)
codecs.append(codec)
return codecs
Loading
Loading