Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 3 numpy arrays for manifest internally #107

Merged
merged 45 commits into from
Jun 18, 2024
Merged

Conversation

TomNicholas
Copy link
Member

Supercedes #39 as a way to close #33, the difference being that this uses 3 separate numpy arrays to store the path strings, byte offsets, and byte range lengths (rather than trying to put them all in one numpy array with a structured dtype). Effectively implements (2) in #104.

Relies on numpy 2.0 (which is currently only available as a release candidate).

@TomNicholas TomNicholas added enhancement New feature or request performance labels May 10, 2024
@TomNicholas TomNicholas marked this pull request as draft May 10, 2024 16:55
@martindurant
Copy link
Member

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

Unfortunately, the references have been deleted, because the whole dataset is now also available as zarr. I may have the chance sometime to regenerate them, if it's important.

@martindurant
Copy link
Member

martindurant commented May 10, 2024

Also, a super-simple arrow- or awkward-like string representation as contiguous numpy arrays could look something like

class String:
    def __init__(self, offsets, data) -> None:
        self.offsets = offsets
        self.data = data

    def __getitem__(self, item):
        if isinstance(item, int):
            return self.data[self.offsets[item]: self.offsets[item + 1]].decode()
        else:
            return String(self.offsets.__getitem__(item), self.data)

>>> s = String(np.array([0, 5, 10]), b"HelloWorld")
>>> s[1]
'World'

@TomNicholas
Copy link
Member Author

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

That's useful context for #104, thanks Martin!

Copy link
Member Author

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h5py (the latter from the scientific python nightly repo)

Can I do this by adding a special pip install command to a conda env?

@TomNicholas
Copy link
Member Author

Now builds atop #139

@keewis
Copy link
Contributor

keewis commented Jun 13, 2024

Can I do this by adding a special pip install command to a conda env?

you can:

# - netcdf4
- h5netcdf
- pip
- pip:
  - -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple
  - --pre
  - h5py
  - numpy

(and if you need multiple packages from different sources, put the definition in requirements.txt files and include them in the environment file using -r <requirements.txt>)

@martindurant
Copy link
Member

Note that en efficient packing of a string array would be all the character data concatenated, and a separate offsets array. This is what arrow or awkward does. Parquet can also store this way, but alternating length-data is standard. Dense packings like that ssusme immutability https://github.com/martindurant/fastparquet/blob/faster/fastparquet/wrappers.py#L51 has the simplest possible version of this (yes, fastparquet is intending to move to pure numpy within months, if I didn't say this already).

@TomNicholas
Copy link
Member Author

@martindurant does numpy's new string dtype not do that too?

@martindurant
Copy link
Member

No, it allocates on the heap and stores pointers, I believe. That is not (necessarily) memory contiguous, but allows for mutability.

@TomNicholas
Copy link
Member Author

This PR now works (passes all tests locally). The failures are the same as on main, tracked in #147 .

Note that en efficient packing of a string array would be all the character data concatenated, and a separate offsets array. This is what arrow or awkward does.

What about using a pyarrow string array instead:

@rabernat @martindurant a pyarrow string array might be a bit more memory-efficient, but

a) I can store millions of chunk references using just a few MB with this new numpy dtype, which seems plenty good enough to me

In [1]: from virtualizarr.tests import create_manifestarray

In [2]: marr = create_manifestarray(shape=(100, 100, 100), chunks=(1, 1, 1))

In [3]: marr
Out[3]: ManifestArray<shape=(100, 100, 100), dtype=float32, chunks=(1, 1, 1)>

In [4]: marr.manifest._paths.nbytes / 1e6
Out[4]: 16.0

In [5]: (marr.manifest._paths.nbytes + marr.manifest._offsets.nbytes + marr.manifest._lengths.nbytes) / 1e6
Out[5]: 24.0

b) IIUC pyarrow string arrays are not N-dimensional arrays, and half the point of this PR is that my implementation of concat/stack/broadcasting for ManifestArrays becomes really simple if I can just delegate those operations to a wrapped set of numpy arrays.

Comment on lines 127 to 139
@classmethod
def validate_chunks(cls, entries: Any) -> Mapping[ChunkKey, ChunkEntry]:
validate_chunk_keys(list(entries.keys()))
def from_arrays(
cls,
paths: np.ndarray[Any, np.dtype[np.dtypes.StringDType]],
offsets: np.ndarray[Any, np.dtype[np.int32]],
lengths: np.ndarray[Any, np.dtype[np.int32]],
) -> "ChunkManifest":
"""
Create manifest directly from numpy arrays containing the path and byte range information.

Useful if you want to avoid the memory overhead of creating an intermediate dictionary first,
as these 3 arrays are what will be used internally to store the references.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushnag @sharkinsspatial you might want to try building numpy arrays of references and passing them to this new constructor instead to keep memory usage down.

@TomNicholas TomNicholas merged commit c4d4325 into main Jun 18, 2024
4 checks passed
@TomNicholas TomNicholas deleted the numpy_arrays_manifest branch June 18, 2024 20:23
@TomNicholas TomNicholas mentioned this pull request Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In-memory representation of chunks: array instead of a dict?
4 participants