-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hypergrib as as a grib reader #238
Comments
Sounds awesome! My intention is definitely for But |
Just to flag that, after @emfdavid pointed out that some GRIB datasets contain billions of GRIB messages (chunks), I've changed the design for A kerchunk JSON manifest of a trillion chunks would be about 50 TB! Instead, the plan is that We will still store some metadata for each NWP dataset. But it'll just be the array shape, dimension names, and coordinate labels. And maybe some additional metadata for when NWPs are upgraded and, for example, more ensemble members are generated per run. I'm guessing the lack of a manifest makes The discussion that triggered this design change is here: JackKelly/hypergrib#14 (comment) The new design is sketched out here: https://github.com/JackKelly/hypergrib/blob/main/design.md |
Yikes! It's not impossible to imagine handling this with the virtualizarr stack - it just becomes territory where you need out-of-core-computation just to concatenate / save the manifests (i.e. dask/cubed). It's the same scale of problem as processing TB's of numerical array data now.
I wonder if this could be handled by having a |
If grib interleaving was predictable that would be interesting. I thought it was pretty random given what we'd seen in AMPS. Why trillions of chunks? Is each time chunk tiny? It seems like it must be a poor format choice (for a specific case) if the index must be so big. Certainly there's a lot of array data out there that should just be a table. |
I realize the xarray dataset is required to have a finite number of chunks and finite dimensions (or maybe we can work around this?) but with the ability to compute all the chunk that will ever exist for a dataset like HRRR or ERA5, what is left to concatenate? I want to be able to call xr.open and have every variable for all time and all forecast horizons available. I think we can do that without loading any data. |
Finite yes, but if you can cheaply compute the chunk references then how is finite a limitation?
You personally might not need to concatenate further for these particular datasets, but I bet you someone is immediately going to be like "I have this weird dataset with multiple sets of GRIB files that I want to form part of the same Zarr store..." 😄
I know, I'm not suggesting we load data. I'm saying that if you're going to build optimized readers and representations of grib datasets, then as you're going to want to save the references out to the same formats as virtualizarr does, and you're likely going to want to merge/concatenate/drop virtual variables at some point, you may as well see if you can fit the algorithmic inflation part into the virtualizarr framework. The alternative would be creating a 3rd API for creating, joining, and writing virtual references. |
Well, I'll admit that I'm not aware of any NWP datasets today that have trillions of GRIB messages 🙂. But there datasets today that have tens of billions of GRIB messages (see the calculations here), and I could well imagine seeing datasets with trillions of chunks in the next few years. In general, each "chunk" in GRIB is a 2-dimensional image, where the two dimensions are the horizontal spatial dimensions. For example, a single GRIB message might contain a 2D image which represents a prediction for temperature at the Earth's surface at a particular time. |
I like the sound of this!
Yup, sounds good to me! Not least as a nice way to generate For @emfdavid's use-case (which is the same as my use-case!), the plan is that |
heh - I have been that guy more often than not!
When you open the dataset you need to pick an end date, that defines the number of chunks and the dimensions of the dataset. Could we just pick something like a year in the future from when you call open? I hacked xreds xpublish to server the database backed grib manifest we build in big query. Reloading it every few minutes to get updates is a major drag. Reloading this new lazy dataset would much less onerous... but we still need to pick an end date for something we know goes into the future to varying degrees. |
Sounds good, but it's worth pointing out here how similar the problems of creating a new xarray backend to read data and writing a virtualizarr "reader" to generate references are. In both cases you look at files, create an array-like lazy metadata-only representation of the actual data and its location (either xarray's In fact, because its also useful for virtualizarr readers to also be able to actually load variables into memory (see the
Thinking about this more it's perhaps not trivial... The reason concatenation is possible in virtualizarr is because
So IIUC you have datasets that are appended to frequently? And by reloading you mean that you don't want to reindex the entire thing just to append the new chunks? Cheaply appending the virtual references for the new chunks to an existing store is the same problem as discussed in #21 (and something the Earthmover folks should have better solution for soon...) I don't see why this requires picking an arbitrary end date in the distant future that hopefully includes all the data you plan to write - instead the data schema should be updated only when you actually do write (/create virtual references to) new data. |
If you are creating your own dataset with thousands of chunks having an explicit manifest is great and working to make it easy to append/concat is a critical feature for the xarray/zarr toolset. Manifest building and appending datasets are problems that I worked really hard to solve for Camus over the past year. Last week @JackKelly turned the problem on it's head, pointing out the the manifest could be computed algorithmically and I look at the problem completely differently now. Our goal is working with enormous data sets we don't control, with millions of grib files, billions of chunks put in the cloud by NODD and ECMWF. They add 100,000 chunks a day to some of these datasets and for operational use, I want the latest data as it becomes available. Directly building native zarr datasets seems out of reach for these organizations at the moment, so we are looking for solutions to work with what we have got. I was planning to advocate for building a public database backed manifest that could be used with virtualizarr or with kerchunk as I have been doing privately at Camus for over 6 months now. Based on operational experience, this architecture is difficult and costly to manage operationally. I am sure my work leaves much to could be improved, but I am talking about datasets with variables have more than 100,000 chunks with more to append each hour as they land in the cloud bucket. I am using the dataset across dozens of distributed workers running big ML jobs. Lazily computing the manifest itself is just a very different way of thinking about these big NWP archival datasets that could be much easier. I fully support and appreciate virtualizarr and we should definitely build bridges between these solutions (materializing a manifest where needed) but I suspect working with an xarray or zarr dataset free of the materialized manifest could be a very powerful solution in it's own right. There a lots of pitfalls that will make this paradigm shift difficult working with the existing APIs so having insight from the folks that build the existing stack would be fantastic. Can we do this - how? |
One alternative to chunk-level dataset aggregation is file-level dataset aggregation. This is actually the traditional approach that has been in use for decades (see e.g. ncml). Chunk-level aggregation emerged in the context of cloud computing, where the cost of peeking at each file is high due to the latency of HTTP requests. But if you somehow know a-priori exactly where the chunks are in each file, you can avoid that. |
I'm not really sure what you mean in this context @rabernat . The client still needs to know where inside the file to read from, so we're back to chunks aren't we? If anyone wants to chat about this we do have a bi-weekly virtualizarr meeting starting in a couple of minutes! EDIT: In the unlikely event that you are trying to join today please say so - having some zoom issues. |
Can't make it this week - but I will try to join next time. |
I was trying to join and couldnt get in and just saw this. So if you happen to see this i am willking to join! |
Yes, but there are many ways you can provide that information:
|
If you knew this pattern for certain, couldn't you just make an implementation of the zarr Store API which implements Then you open that Zarr store with whatever tool you like to do the processing. A similar idea would be to have "functionally-derived virtual chunk manifests" in icechunk. |
Actually on second thoughts I think this:
is effectively just one possible implementation of the original idea of @JackKelly 's above:
|
I just learned about hypergrib (via this blog post).
I don't know much about how GRIB files are structured, but if @JackKelly makes a very performant GRIB reader that can:
ChunkManifest
objects and associated metadata (ideally usingChunkManifest.from_arrays()
to avoid any intermediate representation),then we could add it as a new reader to
virtualizarr.readers
.In fact FYI my intention now is to probably make an entrypoint system for virtualizarr, in addition to plugging in to xarray's backend entrypoint system. That would allow this syntax for creating virtual
xr.Dataset
objects containingManifestArrays
:The text was updated successfully, but these errors were encountered: