Extend pyhf contrib download to allow for uncompressed targets #1111

matthewfeickert · 2020-10-14T20:44:43Z

Description

This feature request was a suggestion of @lukasheinrich's:

I wonder whether we should have something like
$ pyhf contrib download http://pyhf.scikit-hep.org/examples/shapesysexample
or similar, as a way to get starter workspaces.

This should be easy to implement and I think mostly will just require making

pyhf/src/pyhf/contrib/utils.py

Lines 55 to 58 in c1727ce

    
           with tarfile.open( 
        
               mode="r|gz", fileobj=BytesIO(response.content) 
        
           ) as archive: 
        
               archive.extractall(output_directory)

be able to deal with non-compressed targets.

The text was updated successfully, but these errors were encountered:

kratsg · 2020-10-15T12:03:21Z

See #1090. We can probably abstract this a way a little bit.

matthewfeickert · 2020-10-19T00:05:34Z

Hm, seems one thing that needs to get considered though is since this is currently reading streams, then for

magic_headers = {
    "gzip": b"\x1f\x8b",
    "zip": b"PK",
    "JSON": re.compile(br"^\s*{"),
}

there's the question of how to take

fileobj = BytesIO(response.content)

and check the header as there is no io.BytesIO.startswith. Maybe something with io.BytesIO.read.

matthewfeickert · 2020-10-19T05:04:58Z

So sketching things out there's the possibility of doing something like

        magic_headers = {
            "gz": b"\x1f\x8b",  # gzip
            "bz2": b"PK",  # zip
        }
        header_len = max(len(header) for header in magic_headers.values())

        with requests.get(archive_url) as response:
            if compress:
                with open(output_directory, "wb") as archive:
                    archive.write(response.content)
            else:
                fileobj = BytesIO(response.content)
                file_type = None
                for _file_type, header in magic_headers.items():
                    if fileobj.read(header_len) == header:
                        file_type = _file_type
                with tarfile.open(
                    mode=f"r|{file_type}", fileobj=BytesIO(response.content)
                ) as archive:
                    archive.extractall(output_directory)

but this probably isn't a step towards a solution as any(?) benefits of using a stream seem negated (I'm actually not really sure what the advantages of a stream over a file are here for tarfile.open). Also this current implementation still assumes a pyhf pallet and so a tarfile and not an arbitrary file payload as originally suggested for the starter example workspaces.

jpivarski · 2020-10-19T14:18:31Z

Seeking is important: if you're going to fileobj.read(header_len) for each of the magic_headers.items(), you'll have to seek back to 0 between each one because reading moves the file poiner. But presumably, you chose head_len to be the maximum length of all the possible headers so that you could read it once and compare each possibility against what you've read.

But then again, adding round-trip times to read two bytes is bad for latency if you're accessing something remote, like HTTP through requests. In that case, it would probably make more sense to subclass BytesIO to make it read a healthy-sized batch on first access, check the two-byte header as a side-effect of that first read, and then let the user of your BytesIO subclass (Python's tarfile module) drive it from then onward (i.e. let tarfile decide how large of a read to make, once its read requests exhaust your initial chunk).

But then again, the tarfile.open documentation says that mode='r' or (equivalently) mode='r:*' reads "with transparent compression." I interpret that as "it does what I've described above—detects the compression type of the file by reading its magic headers and then decompresses appropriately." A quick experiment will confirm or refute that interpretation. It wouldn't be surprising if the tarfile module already does this for you, since compressed tarfiles are common, and most people will want this (getting back to the original motivation from Slack). It might have been a hidden feature that you've already been inheriting from your use of this standard library module.

matthewfeickert added feat/enhancement New feature or request good first issue Good for newcomers contrib Targeting pyhf.contrib and not the core of pyhf labels Oct 14, 2020

matthewfeickert mentioned this issue Jul 16, 2021

Make pyhf contrib download be able to handle multiple compression types #1519

Closed

GraemeWatt mentioned this issue Oct 26, 2021

Pass Accept header in contrib.utils.download #1491

Closed

matthewfeickert mentioned this issue Oct 26, 2021

Refactor contrib.utils.download internals #1672

Closed

matthewfeickert mentioned this issue Nov 12, 2021

refactor: Make contrib.utils.download robust to archive file types #1697

Merged

4 tasks

matthewfeickert self-assigned this Nov 16, 2021

matthewfeickert closed this as completed in #1697 Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend pyhf contrib download to allow for uncompressed targets #1111

Extend pyhf contrib download to allow for uncompressed targets #1111

matthewfeickert commented Oct 14, 2020

kratsg commented Oct 15, 2020

matthewfeickert commented Oct 19, 2020

matthewfeickert commented Oct 19, 2020 •

edited

Loading

jpivarski commented Oct 19, 2020

Extend pyhf contrib download to allow for uncompressed targets #1111

Extend pyhf contrib download to allow for uncompressed targets #1111

Comments

matthewfeickert commented Oct 14, 2020

Description

kratsg commented Oct 15, 2020

matthewfeickert commented Oct 19, 2020

matthewfeickert commented Oct 19, 2020 • edited Loading

jpivarski commented Oct 19, 2020

matthewfeickert commented Oct 19, 2020 •

edited

Loading