Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HTTP cache with lower memory usage #11143

Merged
merged 23 commits into from
Sep 26, 2023

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented May 24, 2022

Fixes #2984

Additionally, updates CacheControl to the latest release.

TL;DR: Huge decrease in memory usage, by using streaming cache instead of loading (multiple copies of) the whole file into memory in one go.

Note to reviewer: The Windows CI didn't run for reasons that I can't control. Please rerun, and if tests fail I will fix them.

Open questions

What to do about the old cache? My personal feeling is leave it alone, and then follow-up in a year or so with code to delete old caches. Happy to implement some cleanup code if you disagree.

Did I catch all places in pip that need updating? The original code is from a year ago so there might be something changed since then.

Memory usage improvements

I did a before/after memory usage comparison of installing tensorflow, which is a 500MB wheel for my OS/Python version. The measurement was done with the following script:

#!/bin/bash
set -euo pipefail
shopt -s expand_aliases
alias maxmemory='/usr/bin/time -f "%M max resident memory KB"'

rm -rf benchmarking-venv
python3 -m venv benchmarking-venv
export PIP_CACHE_DIR=benchmarking-venv/cache/
. benchmarking-venv/bin/activate

# Step 1: cold install, downloading
pip install --quiet --upgrade "$1"
echo -n "Cold cache: "
maxmemory pip install --quiet tensorflow

# Step 2: warm install, from local cache
pip uninstall --quiet -y tensorflow
echo -n "Warm cache: "
maxmemory pip install --quiet tensorflow

I ran it twice, once with latest released version of pip (./measure-memory.sh pip) and once with my branch (./measure-memory.sh $PWD). Here are the results, measuring peak memory usage:

$ ./memory-measure.sh pip==23.1.2
Cold cache: 1768380 max resident memory KB
Warm cache: 2912560 max resident memory KB
$ ./memory-measure.sh .   # i.e. install my branch
Cold cache: 624140 max resident memory KB
Warm cache: 99988 max resident memory KB

As you can see, memory usage is significantly improved!

@itamarst
Copy link
Contributor Author

itamarst commented May 24, 2022

Summary of failing checks:

  • The vendoring is failure is due to patching with workaround for CacheControl issue I mentioned above.
  • The Windows / Python 3.7 failure is due to this PR. Update: Should be fixed now, hopefully.

@itamarst
Copy link
Contributor Author

@pfmoore would appreciate any feedback.

@pfmoore
Copy link
Member

pfmoore commented May 24, 2022

Not sure what you want from me? I don't feel I have anything to say regarding the technical details of the implementation. As far as the design goes, my position remains the same. I don't like pip leaving a load of unused cache data that it will no longer use around. This PR seems to simply offer a means of listing what's present via pip cache list, but there's no way of clearing out old data, nor any way users will find out that they need to (an obscure command like pip cache list isn't very discoverable).

The savings with the new cache are excellent, and maybe that justifies a sucky UI, and hidden bloat for anyone who isn't already sufficiently affected by the cache size to be actively looking for a way to clean stuff up.

This whole thing feels like we're designing for the rare/suboptimal case (people using both old and new versions of pip on the same machine) where we should be optimising for the normal/recommended case (people always use the latest version of pip). As such, use the new style cache and simply trash an old style cache if found is the indicated approach. Just possibly with a temporary means to opt into not deleting the old-style cache (a --keep-old-style-cache option), but this is of limited use because forget to use it once and your old cache is gone.

I'd be very interested to know how other CacheControl users have handled this sort of transition. I can't believe we're the only people in this situation. I wish CacheControl had better support for this type of transition, but as they don't seem to, I'd like to know how their other users feel about the situation.

At the moment, I'm not clear on the way forward here. We still seem no closer to having an acceptable transition strategy. But maybe I'm the only one with reservations about just leaving the old cache around. If so, one of the other maintainers can override me and merge this - I won't object to that, my concerns aren't severe enough that I want to block someone else from accepting this if they feel I'm making too much of the issue.

@itamarst
Copy link
Contributor Author

I am happy to clear out the old cache if that's what you prefer, I don't have strong opinions here. And I don't know how pip technical decisions work, or who makes decisions here, so I'm not sure how to calibrate here... Should I update the PR with a "clean out old cache on startup of new pip"?

@pfmoore
Copy link
Member

pfmoore commented May 25, 2022

And I don't know how pip technical decisions work, or who makes decisions here

Mostly by rough consensus, so I think the best thing to do is to wait for some of the other @pypa/pip-committers to weigh in with their opinions.

@itamarst
Copy link
Contributor Author

Note btw that this may also speed up installation slightly, since it doesn't involve decoding a 500MB messagepack message to extract the body.

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label May 30, 2022
@itamarst itamarst force-pushed the 2984-new-cache-lower-memory branch from 0d76330 to e8654ad Compare June 2, 2023 17:53
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Jun 2, 2023
@itamarst
Copy link
Contributor Author

itamarst commented Jun 2, 2023

I am back to working on this now that cachecontrol is under PyPA management and has had a release.

@itamarst itamarst marked this pull request as ready for review June 2, 2023 19:25
@itamarst
Copy link
Contributor Author

itamarst commented Jun 8, 2023

Any chance for a review?

@pfmoore
Copy link
Member

pfmoore commented Jun 8, 2023

I am happy to clear out the old cache if that's what you prefer, I don't have strong opinions here.

Did you do this, or is the code still implementing the same strategy as it was when I previously commented? I don't think I have anything new to add beyond what I said above.

@itamarst
Copy link
Contributor Author

itamarst commented Jun 8, 2023

It's still the same strategy. The main difference is that this is now at least theoretically mergeable, insofar as it's vendoring a released version of cachecontrol instead of a patched one.

@pfmoore
Copy link
Member

pfmoore commented Jun 8, 2023

Cool. I still think we should build this to optimise for people who are upgrading to the latest pip, as I said above (and hence switch to the new cache format and clear out the old data, rather than leaving it there "just in case") but that's just my view. I'm fine if you'd rather leave the code as is and wait for other opinions.

@dstufft
Copy link
Member

dstufft commented Jun 9, 2023

I do have a few concerns with wiping out the cache:

  • I suspect it might be reasonably common to use multiple versions of pip since each virtual environment gets it's own pip, and this can result in a lot of "cache thrash" as different versions are writing and deleting to that path.
  • The deletion for a large cache may end up being kind of slow, which might be fine for a one time cost but if the cache thrashing happens it might happen over and over again.
  • The deletion code is going to be checking for the old cache ~forever on every pip invocation.

I wonder if it might not be better to sort of phase it in? Basically do:

  1. Release a version of pip that starts writing to the new cache, but leaves the old cache alone.
  2. Wait some amount of time (is the dep policy still 2 releases? that sounds good to me).
  3. Release a version of pip that starts deleting the old cache.
  4. Wait some amount of time again (2 releases?)
  5. Release a version of pip that stops deleting the old cache.

I dunno, take my thoughts with a big grain of salt, just what popped into my head.

@astrojuanlu
Copy link
Contributor

Alternatively, put this behind a flag? --low-ram-cache writes to the new cache, then in the next release not using it emits a warning, then in the next release it becomes the default but it can be disabled, then in the next release the disable flag gets deprecated and emits a warning, then in the next release the disable flag gets removed and the other one becomes a no-op.

@pradyunsg
Copy link
Member

Let's leave the cache in place for now. I think we should wipe it in a later release, but not in the first cut for sure.

@pradyunsg
Copy link
Member

pradyunsg commented Jun 9, 2023

Alternatively, put this behind a flag?

I don't think this needs an opt-in/opt-out cycle, to be honest. It's unclear to me what that would achieve.

src/pip/_vendor/vendor.txt Outdated Show resolved Hide resolved
@pfmoore
Copy link
Member

pfmoore commented Jun 9, 2023

Let's leave the cache in place for now. I think we should wipe it in a later release, but not in the first cut for sure.

OK. How will we ensure we don't forget to do that? Maybe create a follow-up PR now that adds the "delete old cache data" functionality, and add it to a later release milestone?

Would it be worth adding some docs in https://pip.pypa.io/en/stable/topics/caching/ covering how to manually clear the old cache data if you know you don't need it? The "Cache management" section is basically all about managing the wheel cache, and doesn't help anyone who wants to clear the HTTP cache. Even if it's just "to clear the HTTP cache, delete the contents of such-and-such directory, there's no fine control" along with a note explaining that pip switched to a new cache format in version X.Y and the old cache data will no longer be used after that.

This is very much from my personal perspective, as someone who hates wasting disk space on "obsolete" cached data (there's a bunch of directories on my PC linked to apps I no longer have installed, and that annoys the heck out of me 🙁). I'd rather pip not be one of the tools that annoys me like that...

@itamarst
Copy link
Contributor Author

itamarst commented Jun 9, 2023

OK, so summary of tasks so far—

In current PR:

  • Documentation of some sort
  • Update to latest CacheControl

@sbidoul
Copy link
Member

sbidoul commented Jun 9, 2023

I personally don't think we should wipe the old cache without user consent.

I frequently see users with a large number of virtual environments with various python and pip versions. Wiping the cache would make thing painful for them, while really benefiting only few users who would have very large caches and can afford to use the same (or modern) pip versions everywhere.

So I'd be in favor of documenting. We could also log a warning when creating the new cache while an old cache is present, to inform users and give them information so they can decide whether to clean or not.

@itamarst
Copy link
Contributor Author

itamarst commented Jun 9, 2023

A warning as an alternative seems like an option for "in a couple of releases", insofar as the gap is intended to decrease chances of people using multiple versions of pip with different cache types. Put another way, a warning "In v23.6 and later there is a new cache format, you may wish to delete the old cache if you never use older versions" is not something you want to do in v23.6 because from personal experience people don't upgrade all random venvs immediately.

I have filed #12082 for the "what do we do with old cache" issue. In addition to the options discussed here (delete automatically, warn with instructions) I added a third option. Further discussion of the cleanup issue should probably happen there insofar as a cleanup mechanism is not deemed immediately necessary in this PR.

@itamarst itamarst force-pushed the 2984-new-cache-lower-memory branch from 45c69e2 to a86d7d6 Compare June 12, 2023 13:32
@itamarst
Copy link
Contributor Author

I have addressed the immediate comments, and there is an issue filed (#12082) for whatever the clean-up policy is going to be in a future release.

@itamarst
Copy link
Contributor Author

itamarst commented Sep 6, 2023

Beyond the two comments above, anyone have other comments and able to do a formal sign off? Thank you!

Copy link
Member

@dstufft dstufft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, appears to be a solid improvement over how the cache currently works and the migration path looks good.

Since I haven't been super active in pip lately I'm not going to merge this myself out of respect for the folks who have been, but for whatever my review is worth, I think this is good to go.

@pfmoore
Copy link
Member

pfmoore commented Sep 6, 2023

I'm willing to merge on the basis of @dstufft's review, as he knows far more than me about HTTP caching. And frankly, this has been waiting around for ages now - your patience is appreciated @itamarst! Assuming the CI passes, and no-one else objects, I'll merge later this evening.

@itamarst
Copy link
Contributor Author

itamarst commented Sep 6, 2023

Thank you! There's one unaddressed comment above re output, I can try to do it today if I have time, otherwise tomorrow.

@pfmoore
Copy link
Member

pfmoore commented Sep 6, 2023

Cool - I'd misread "maybe we can merge" as meaning it was OK as it is, rather than "we can update the doc to merge these cases" 🙂

Ping me once you've done the update and I'll merge the PR then.

@itamarst
Copy link
Contributor Author

itamarst commented Sep 6, 2023

OK, did a tweak, so if tests pass that should address that comment.

@itamarst
Copy link
Contributor Author

No idea why this is failing. Doesn't seem related to the PR?

@uranusjr
Copy link
Member

Unrelated, see discussion in #12267.

@pfmoore
Copy link
Member

pfmoore commented Sep 25, 2023

Added to the 23.3 milestone, so that it doesn't get forgotten. If we decide to defer it, let's make sure that at least it's a conscious decision (personally, I'm fine for it to go in).

@itamarst could you merge the latest main into this PR and fix up any remaining CI issues?

@pfmoore pfmoore added this to the 23.3 milestone Sep 25, 2023
@pfmoore
Copy link
Member

pfmoore commented Sep 26, 2023

OK, merging as per this comment

@pfmoore pfmoore merged commit 2ba5acc into pypa:main Sep 26, 2023
@itamarst
Copy link
Contributor Author

Thank you!

@pfmoore
Copy link
Member

pfmoore commented Sep 26, 2023

Your patience and persistence in getting this PR incorporated is highly appreciated! Sorry it was such a long process.

@RonnyPfannschmidt
Copy link
Contributor

Absolutely awesome, I'm very happy this landed, thanks 🙏

@itamarst
Copy link
Contributor Author

Mostly I just hope I didn't break everything 😁

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive memory use when caching large packages