-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download crates in parallel with HTTP/2 #6005
Conversation
r? @matklad (rust_highfive has picked a reviewer for you, use r? to override) |
r? @ehuss |
Hm, so I've wondered about this in the past -- I sort of intuitively always thought that parallel downloads from the "same" server wouldn't really help much because in most cases the bandwidth limitation is somewhere in the middle (likely closer to the client than the server, in fact, I'd imagine). Not that I don't think this is a good idea to refactor in general if only to have downloads and builds potentially running in parallel or parallel downloads from different registries. Mostly just interested if there's perhaps some resource anyone knows about that would help clarify this for me :) |
The claims in #5161 (comment) and #467 provide what I think are some pretty strong data for pursuing something like this. I think it's fine to hold off on landing this until there's at least a proof-of-concept to test out as well. |
Great work. It was what I was stuck (or maybe just upset) with and this makes implementing the actual part much easier. Thanks! @Mark-Simulacrum, the whole point of this is to parallelize the initial round trips. The crates.io server in US takes non trivial time to reach from Japan, and thus the time is wasted mostly waiting for a trivial redirect. If we someday absorb the redirect at CDN layer, that could be an improvement as well. |
So to add to this dataset, I've got a proof-of-concept working that uses http/2 with libcurl to do downloads in Cargo itself. On my machine in the Mozilla office (connected to a presumably very fast network) I removed my |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks.
Overall, this PR introduce a rather complicated abstraction which is probably caused by the way we resolve dependencies currently.
The approach is more like uncontrolled concurrency (like goroutines, or whatever you call), which IMO should be replaced by explicit logic boundaries. Do you know what information is missing from index and requires crate fetch instead? I would like to be aware of them if possible. If you don't know, no worry; this approach will work for the near future, and we can tackle with this when we change the way index works.
/// Completes at least one downloading, maybe waiting for more to complete. | ||
/// | ||
/// This function will block the current thread waiting for at least one | ||
/// create to finish downloading. The function may continue to download more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
/// crates if it looks like there's a long enough queue of crates to keep | ||
/// downloading. When only a handful of packages remain this function | ||
/// returns, and it's hoped that by returning we'll be able to push more | ||
/// packages to download into the queu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
@ishitatsuyuki yeah it's not a great abstraction because it's not as simple as "download everything in this list". I initially didn't want to disturb the existing code too much because While the unit dependencies may be complicated though it may be the case that package dependencies are much simpler, so if desired we can try to do that instead and always use an interface that looks like "download this list of packages". I'm not sure what you mean about missing index information though? The index is all Cargo looks at before it starts downloading crates |
c9421a9
to
eed35ed
Compare
I had to take the new code for a test drive. On my (slow) network, downloading all the crates for a fresh cargo build goes from 48s to 22s!! |
If you want some early feedback: The progress indicator only works at the resolution of entire packages. If you have only 1 package to download, and it takes a long time to download, cargo prints nothing until after the download is complete. It would be nice if it could display some progress based on the package size. |
@ehuss good to hear! That's not quite as large of an improvement as I would have hoped for but still is nice :) For progress indicators I'm not really sure what to do myself. Previously we'd download one crate at a time which meant it was relatively obvious how to indicate progress, but now we don't have a great way to indicate progress I think because it's a whole soup of crates, and we don't know how big they all are. I initially tried to make a progress bar of "total bytes downloaded" vs "total bytes needed to download", but it was very jumpy and non-continuous, so I wasn't really sure what was going on there. I don't think though we'll be able to land this with a package resolution, as you've seen with larger crates it's just not enough information compared to what we have today. |
It looks like we have to download some tarballs in order to determine unit dependencies currently? (Is this correct?) If so, what particular information that is not in the index does it need? It seems having the list of unit dependencies ahead of time is also good for displaying better progress percentages. We don't know the file size anyway, but we can address it some day by changing the index format. |
Oh so the index gives us the "maximal" crate graph sort of in terms of independent of activated features and which target platform. The In that sense what we're doing, creating a minimal resolution crate graph, can require more information than what's in the index. The index doesn't contain information about things like the crate's library targets, for example. I don't think we need to change the index for this, though, we can probably just get smarter in Cargo about constructing the minimal crate graph conservatively. |
eed35ed
to
14faed2
Compare
Looking into things a bit, Heroku doesn't support HTTP/2 (ref) which means that we won't get multiplexing when talking to crates.io. Our CDN with static.crates.io, cloudfront, does support HTTP/2 though. |
@alexcrichton In my former PR I used pipelining which reliably works for Heroku's reverse proxy. It will achieve similar efficiency improvements without creating too much TCP connections. |
c7cb406
to
23b7a29
Compare
26374a9
to
955ae8d
Compare
Ok! I've done some various tinkering locally and I think this is at a good spot. This now works on all platforms, I've tested on all platforms, and I'm relatively happy with the new output progress. The output looks like this: Fast CPU with fast internet Not as fast computer with throttled LTE-like internet Some notable points of interest
Definitely some room for improvement, but I think we're not necessarily hit by any showstoppers! With the latest commit this should also be good to go in terms of landing, I don't know of any further blockers. |
Does this need testing when behind a proxy? The status is still a little jumpy, but better than before. And I definitely don't think it should hold things up, but maybe we can tweak it in the future. I like the new final summary. |
Ah right yeah, I should test behind some non-http-2 proxies! I'll do that today. I also plan on filing a few follow-up issues after this lands to cover the remaining work items of inflating in parallel as well as computing the set of crates to download up-front. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the typos pointed out by ishitatsuyuki still need to be fixed as well.
src/cargo/core/package.rs
Outdated
// We can continue to iterate on this, but this is hopefully good enough | ||
// for now. | ||
if !self.progress.borrow().as_ref().unwrap().is_enabled() { | ||
self.set.config.shell().status("Downloading", &dl.descriptor)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about saying "Downloaded" instead?
Also, just as a general issue, in this mode nothing is displayed until the first crate is finished, which might be a long time. I'm wondering how common this scenario would be. At least for me, my editor integration does not use ttys, so I only get this mode. Would it be possible to print some generic "Downloading crates..." message at the start when progress is disabled?
src/cargo/ops/cargo_doc.rs
Outdated
packages.get(pkgid) | ||
}) | ||
.collect::<CargoResult<Vec<_>>>()?; | ||
let mut pkgs = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these 10 lines be shared with the duplicate in compile_ws
?
It seems to fix some issues perhaps!
Also at the recommentation of curl folks I've disabled HTTP1 pipelining, which may also fix the bug you were seeing @ehuss? |
I can confirm that the pipelining change fixes the problems I was seeing. With the pipelining on, I tried a newer version of squid but got the same problems. It was version 4.2, and a very basic conf file:
|
Ok! In that case I think we'll leave that turned off but leave HTTP/2 enabled even when proxies are enabled. Mind taking a final look at the code for an r+? |
@bors r+ What do you think about asking the broader community to help test this after it has been on nightly for a little bit? I'm a little concerned about people with weird networks/proxies possibly having issues. |
📌 Commit 9da2e70 has been approved by |
Certainly! Do you think that we should perhaps move the multiplexing (which I think enables HTTP/2?) behind a config option? That way this would be off by default and the call to test could have instructions about how to enable. Hm... actually I'm gonna do that. @bors: r- |
Let's hopefully weed out any configuration issues before turning it on by default!
@bors: r=ehuss |
📌 Commit f4dcb5c has been approved by |
Download crates in parallel with HTTP/2 This PR revives some of the work of #5161 by refactoring Cargo to make it much easier to add parallel downloads, and then it does so with the `curl` crate's new `http2` feature to compile `nghttp2` has a backend. The primary refactoring done here is to remove the concept of "download this one package" deep within a `Source`. Instead a `Source` still has a `download` method but it's considered to be largely non-blocking. If a crate needs to be downloaded it immediately returns information as to such. The `PackageSet` abstraction is now a central location for all parallel downloads, and all users of it have been refactored to be amenable to parallel downloads, when added. Many more details are in the commits...
☀️ Test successful - status-appveyor, status-travis |
This commit brings back the old one-line-per-crate output that Cargo's had since the beginning of time but was removed in rust-lang#6005. This was requested on our [users forum][1] [1]: https://internals.rust-lang.org/t/testing-cargos-parallel-downloads/8466/2
Regressed in rust-lang#6005 it looks like the build plan requires all packages to be downloaded rather than just those coming out of `unit_dependenices`, so let's make sure to download everything! Closes rust-lang#6082
Print a line per downloaded crate This commit brings back the old one-line-per-crate output that Cargo's had since the beginning of time but was removed in #6005. This was requested on our [users forum][1] [1]: https://internals.rust-lang.org/t/testing-cargos-parallel-downloads/8466/2
This PR revives some of the work of #5161 by refactoring Cargo to make it much easier to add parallel downloads, and then it does so with the
curl
crate's newhttp2
feature to compilenghttp2
has a backend.The primary refactoring done here is to remove the concept of "download this one package" deep within a
Source
. Instead aSource
still has adownload
method but it's considered to be largely non-blocking. If a crate needs to be downloaded it immediately returns information as to such. ThePackageSet
abstraction is now a central location for all parallel downloads, and all users of it have been refactored to be amenable to parallel downloads, when added.Many more details are in the commits...