Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open
15 of 21 tasks
TomNicholas opened this issue Jun 8, 2024 · 0 comments
Open
15 of 21 tasks

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

TomNicholas opened this issue Jun 8, 2024 · 0 comments
Labels
documentation Improvements or additions to documentation usage example Real world use case examples

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jun 8, 2024

The reason I made this package is to handle one particularly challenging use case - the [C]Worthy mCDR Atlas - which I still haven't done. Once it's done I plan to write a blog post talking about it, and maybe add it as a usage example to this repository.

This dataset has some characteristics that make it really challenging to kerchunk/virtualize1:

  • It's ~50TB compressed on-disk,
  • It has ~500,000 netCDF files(!), each with about 40 variables,
  • The largest variables are 3-dimensional, and require concatenation along an additional 3 dimensions, so the resulting variables are 6-dimensional,
  • It requires merging in lower-dimensional variables too, not just concatenation,
  • It has time encoding on some coordinates.

This dataset is therefore comparable to some of the largest datasets already available in Zarr (at least in terms of the number of chunks and variables, if not on-disk size), and is very similar to the pathological case described in #104

24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store.

If we can virtualize this we should be able to virtualize most things 💪


To get this done requires many features to be implemented:

Additionally once zarr-python actually understands some kind of chunk manifest, I want to also go back and create an actual zarr store for this dataset. That will additionally require:

Footnotes

  1. In fact pretty much the only ways in which this dataset could be worse is if it had differences in encoding between netCDF files, variable-length chunks, or netCDF groups, but thankfully it has none of those 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation usage example Real world use case examples
Projects
None yet
Development

No branches or pull requests

1 participant