Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend pyhf contrib download to allow for uncompressed targets #1111

Closed
matthewfeickert opened this issue Oct 14, 2020 · 4 comments · Fixed by #1697
Closed

Extend pyhf contrib download to allow for uncompressed targets #1111

matthewfeickert opened this issue Oct 14, 2020 · 4 comments · Fixed by #1697
Assignees
Labels
contrib Targeting pyhf.contrib and not the core of pyhf feat/enhancement New feature or request good first issue Good for newcomers

Comments

@matthewfeickert
Copy link
Member

Description

This feature request was a suggestion of @lukasheinrich's:

I wonder whether we should have something like

$ pyhf contrib download http://pyhf.scikit-hep.org/examples/shapesysexample

or similar, as a way to get starter workspaces.

This should be easy to implement and I think mostly will just require making

with tarfile.open(
mode="r|gz", fileobj=BytesIO(response.content)
) as archive:
archive.extractall(output_directory)

be able to deal with non-compressed targets.

@matthewfeickert matthewfeickert added feat/enhancement New feature or request good first issue Good for newcomers contrib Targeting pyhf.contrib and not the core of pyhf labels Oct 14, 2020
@kratsg
Copy link
Contributor

kratsg commented Oct 15, 2020

See #1090. We can probably abstract this a way a little bit.

@matthewfeickert
Copy link
Member Author

Hm, seems one thing that needs to get considered though is since this is currently reading streams, then for

magic_headers = {
    "gzip": b"\x1f\x8b",
    "zip": b"PK",
    "JSON": re.compile(br"^\s*{"),
}

there's the question of how to take

fileobj = BytesIO(response.content)

and check the header as there is no io.BytesIO.startswith. Maybe something with io.BytesIO.read.

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Oct 19, 2020

So sketching things out there's the possibility of doing something like

        magic_headers = {
            "gz": b"\x1f\x8b",  # gzip
            "bz2": b"PK",  # zip
        }
        header_len = max(len(header) for header in magic_headers.values())

        with requests.get(archive_url) as response:
            if compress:
                with open(output_directory, "wb") as archive:
                    archive.write(response.content)
            else:
                fileobj = BytesIO(response.content)
                file_type = None
                for _file_type, header in magic_headers.items():
                    if fileobj.read(header_len) == header:
                        file_type = _file_type
                with tarfile.open(
                    mode=f"r|{file_type}", fileobj=BytesIO(response.content)
                ) as archive:
                    archive.extractall(output_directory)

but this probably isn't a step towards a solution as any(?) benefits of using a stream seem negated (I'm actually not really sure what the advantages of a stream over a file are here for tarfile.open). Also this current implementation still assumes a pyhf pallet and so a tarfile and not an arbitrary file payload as originally suggested for the starter example workspaces.

@jpivarski
Copy link
Member

Seeking is important: if you're going to fileobj.read(header_len) for each of the magic_headers.items(), you'll have to seek back to 0 between each one because reading moves the file poiner. But presumably, you chose head_len to be the maximum length of all the possible headers so that you could read it once and compare each possibility against what you've read.

But then again, adding round-trip times to read two bytes is bad for latency if you're accessing something remote, like HTTP through requests. In that case, it would probably make more sense to subclass BytesIO to make it read a healthy-sized batch on first access, check the two-byte header as a side-effect of that first read, and then let the user of your BytesIO subclass (Python's tarfile module) drive it from then onward (i.e. let tarfile decide how large of a read to make, once its read requests exhaust your initial chunk).

But then again, the tarfile.open documentation says that mode='r' or (equivalently) mode='r:*' reads "with transparent compression." I interpret that as "it does what I've described above—detects the compression type of the file by reading its magic headers and then decompresses appropriately." A quick experiment will confirm or refute that interpretation. It wouldn't be surprising if the tarfile module already does this for you, since compressed tarfiles are common, and most people will want this (getting back to the original motivation from Slack). It might have been a hidden feature that you've already been inheriting from your use of this standard library module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contrib Targeting pyhf.contrib and not the core of pyhf feat/enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants