Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader for Hugging Face's SafeTensor format #367

Open
TomNicholas opened this issue Jan 2, 2025 · 2 comments
Open

Reader for Hugging Face's SafeTensor format #367

TomNicholas opened this issue Jan 2, 2025 · 2 comments
Labels
enhancement New feature or request readers references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jan 2, 2025

We should be able to write a new VirtualiZarr reader for the Hugging Face SafeTensors format.

Hugging Face safetensors is an interesting example - it's uncompressed so basically just like reading netCDF3, having no internal chunking. But it also puts all the metadata at the start of the file, making it a bit like cloud-optimized HDF5. See also huggingface/safetensors#527 (comment)

Originally posted by @TomNicholas in #218

cc the Pangeo ML people: @weiji14 @negin513 @maxrjones

@TomNicholas TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files readers labels Jan 2, 2025
@TomNicholas
Copy link
Member Author

The format specification seems very straightforward.

If it really is that simple then I think this reader could potentially be implemented without even using the safetensors library, instead just using fsspec and parsing the bytes ourselves. Our unit tests should still probably check correctness against the safetensors library itself though.

@TomNicholas
Copy link
Member Author

There is an interesting issue on safetensors about "multi-part uploads". Apparently it's not officially supported but nevertheless widespread. This suggests desire for the model weights to be chunked and/or version-controlled, which the virtual zarr approach could obviously help with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request readers references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

1 participant