-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/debug: GOMEMLIMIT prolonged high GC CPU utilization before container OOM #58106
Comments
Thanks for filing a new issue with all these details and sorry for the delay; I was out sick last week. Unfortunately, I don't have any solutions off the top of my head for this slow memory leak case for the memory limit. In some sense, making the memory limit softer is only going to be a patch on the issue. Yeah, the performance will be better when you're close to the memory limit, but there's still going to be some kind of degradation. The memory limit seems fundamentally a bad fit because what you want here is to OOM (which the memory limit is actively fighting against), you just also want to take advantage of the fact that you have all this extra memory before then. I think what might fit your criteria is a memory minimum (or what I've called a memory "target") (#44309). For instance, you set CC @dr2chase |
If OOM killing is acceptable behavior, would it be reasonable to implement that yourself? https://go.dev/play/p/aUYguuR-atF |
I was running some more tests; adding more screenshots - they show the same behavior mostly. Here we see GC using 2x more CPU than the actual work (66% of the container). Here it looks even more dramatic, it looks like GC was using 10x more CPU than the actual "work": @zephyrtronium - that's "kinda" what we actually do in our pre-GOMEMLIMIT setup. We watch the runtime GC stats and update GOGC to slow down (and eventually OOM). This gives us a bit more time before OOM, but mostly:
@mknyszek: no problem for the delay; thanks for responding. (I'll be offline soon for ~10 days too).
Oh, this was not my understanding after reading the docs. Even the sliders in https://go.dev/doc/gc-guide suggest that setting GOGC=10 affects GC frequency long before arriving at GOMEMLIMIT (i.e. it doesn't matter if my GOMEMLIMIT is 1 GB or 20 GB if my "live heap" usage is low - GC will run frequently). Am I misunderstanding? I'll run the same tests with GOGC=10 to observe the behavior.
Agreed. However, we cannot know which applications are leaking upfront, and any application might start leaking at any time. So it makes GOMEMLIMIT a "fundamentally bad fit" for any of our applications, which is really sad.
Yeah, the degradation is expected. As long as it's bounded, it's totally fine. In the cases we see above, the GC uses up to 80% of the available cores - and 10x more time than the application "itself". This seems divergent from the proposal, right? Understood that there are no easy/quick fixes here. My highest hope here was that maybe in Go 1.21/1.22 you'd magically come up with a fix that would limit the GC more (any of <25% of GOMAXPROCS, <100% of "real work", <2 cores, $else). |
Sorry, I think I was unclear: this functionality doesn't exist. It was proposed and rejected in #44309 due to lack of interest (and just knobs having a very high bar in general, which I don't think this really met for the team). (Implementing it is relatively easy, if you wanted to try it out anyway. It's not exactly a memory target so much as a heap target when done this way, but the basic idea is to set https://cs.opensource.google/go/go/+/master:src/runtime/mgcpacer.go;l=111?q=heapMinimum&sq=&ss=go%2Fgo.) |
Oh, no, that's on me - I misread the titles and confused myself. #44309 was an alternative to #48409. So we'd effectively had a natively supported ballast. The fast(er) OOM would indeed be nice. Before we delve into adding more knobs, though - what we're seeing with the GOMEMLIMIT right now is not expected, right? GC using 3-10x more CPU than the application itself is not expected, both per the proposal and the gc-guide? I'm a fan of the "fewer knobs == better" design, so if you were amendable to making the soft limit slightly softer, that would be my preference. Both the proposal and the gc-guide seem to suggest that the "50% CPU max for GC" was a starting point (?). |
Ah, no. It wasn't meant to be an alternative, rather complementary. Whereas the current memory limit is a maximum, this would be a memory minimum. GOGC effectively controls what's in the middle. But if you don't have a maximum, and only a very high minimum, and set GOGC low, then in kind of acts like a very soft memory limit. To be clear, I'm not proposing another knob right now, just pointing out that this might've helped your situation. I think it would be very unlikely for us to add a new knob any time soon.
Using 3-10x more CPU than the application is definitely possible. The Go runtime sees its available CPU time as roughly GOMAXPROCS*(wall clock time) for some wall-clock duration. Say for example that GOMAXPROCS=8 but the Go program only ever runs a single goroutine. It's unlikely but possible for that single goroutine to allocation heavily enough that the GC uses up to 50% CPU (actually more, because the GC will also soak up idle CPUs during the mark phase, though it yields to the application very readily). To summarize, the 50% cap is based on total assumed available CPU resources, but it also doesn't prevent the GC from soaking up idle CPU resources. If the GC is triggering often, this idle CPU soak-up can end up using a lot of the available CPU. That's kind of extreme, but it's more plausible if you consider an application that is doing a lot of blocking on I/O. There could potentially be a lot of spare idle CPU time in GOMAXPROCS*(wall clock time) that the GC is allowed to take advantage of. Honestly, this policy of soaking up idle CPU time has been troublesome for us over the years, but getting rid of it at this point is also difficult. It's about -1% throughput and +1% tail latency to remove it. We've removed it for very idle applications, but it remains to be seen what else we do. A gctrace would make it clear how much of what is happening.
Yeah, you're right that it is a starting point, but I think we're going to need more evidence. For example, if we switch it to 25%, why 25%? That's a question I can't answer right now, though I'd like to. (50% is also not rigorously determined, but it has more going for it; it was the default for some prior art, plus we already have it now.) Though, I think this bug report is a good first step. |
I enabled gctraces on the staging service (running with 3 cores). Order of events: - application starts - application gets elected as leader, starts allocations. GOGC=100, no limit - I manually switch to GOGC=-1, GOMEMLIMIT=3GiB - let it run for 10 minutes - I switch back to nolimit, GOC=100Traces
Separately, @mknyszek - is a runtime switch for gctraces at all feasible? I'd be then able to collect on an actual production service easily. |
More samples with the gctraces running overnight - https://gist.github.com/rabbbit/b6419fd71e42e95fffe35476c1f8cf85 - after 24 hours my override expired and we reset to GOGC=100. Traces cover the whole period below: The gist includes the spike I caused manually (as in, it includes the traces I linked earlier yesterday) |
https://gist.github.com/rabbbit/db4d79875b6b1707b3caad3e19cf2d7e might be more interesting since we actually got to the degraded state for hours. Looks like the memory grew by ~150mb in the two hours we're degraded, so around 4% of the limit. Thus setting MEMLIMIT at 95% of the container would mean a 2-hour degradation before OOM. The 10-minute CPU utilization reduction every 40 minutes looks interesting, but I cannot explain it right now. A mirror system with the same code and inputs does not show the same behavior (at lower CPU utilization), but then I also don't expect GC to have any 40-minute cycles, so it must be in our system that I don't understand. |
In the earlier issue, @cdvr1993 wrote in #48409 (comment):
And @zephyrtronium also later suggested a similar strategy above. It does seem that it is a reasonable choice for some applications that are using GOMEMLIMIT for the process to monitor its own memory usage and terminate itself if needed, especially when there is a concern about memory leaks or otherwise as a safeguard. (Or alternatively, monitoring memory to then make a decision about releasing some "nice-to-have memory" or otherwise reacting, though a non-terminal reaction might not really be a choice for most apps, as hinted at by some of the older discussion in #29696 and elsewhere). Question: what metric should be used if an application does want to avoid being stuck in a high-CPU usage degraded state for an extended period when close to GOMEMLIMIT but not yet "rescued" by the kernel's OOM killer? Some seemingly imperfect candidates:
Would it make sense to have an additional metric that represents the rest of memory that is time-aligned with the value in the WIP In other words, maybe the general solution to this issue could be more observability, which might reduce or eliminate the desire for another GC knob. That said, maybe the metrics question is off base or already has a simple answer? |
Skipping the metric question (for now), leaving another batch of gctraces. On the bright side, while it looks like GC is using 100% of the CPU, the cgroup nr_throttled stats stopped degrading further. The gc-traces do show a continued increase of CPU usage (if I read this correctly), but it's increasing at a slower rate: https://gist.github.com/rabbbit/eebd814e926345298e008289f2e92675. I would have attempted to run the same service with more cores to see if GC usage eventually tops of, but I'll be away for a week. Can run this then. |
Hey @mknyszek, Is there anything you need from us? I posted a few gc-traces in the above comments - do they help diagnose what's happening exactly? |
Hi @rabbbit, I looked over some of the example graphs and traces you posted, but I was worried I might have misunderstood some of the time correlation between some of the descriptive text vs. the data, so I thought it might be easier to look at a simplified example. I have a heap benchmark at thepudds/heapbench. I configured it with a slow leak and GOMEMLIMIT using Go 1.20.1 on Linux, with results below. Of course, this won't exactly reproduce what you reported, but maybe a simple example could be useful for discussing with @mknyszek or whoever else. For initial parameters, I looked at your example in #58106 (comment) in a time range around roughly ~2750 seconds (when things seem "happy"), and I imprecisely estimated some key of parameters from your trace/graph as:
I configured the benchmark to roughly match those parameters, plus a 1 MiB/sec leak with a 3 GiB GOMEMLIMIT. (That's an intentionally "fast" leak so that results come back sooner, but it would be easy to slow down the leak). To start the benchmark:
The memory-related arguments are in MiB or MiB/sec. (The last two arguments configure the benchmark to use ~90% of a core for pure CPU work: 100 jobs per sec X 9ms CPU per job on average). The summary is you can see it run initially with ~110% CPU utilization (~90% from pure CPU work, the rest mostly from allocation & GC). It then climbs towards 300% CPU utilization as the leak pushes the live heap up towards the GOMEMLIMIT, which I think reproduces your general observation that GC used more CPU than you expected (though seems aligned with what @mknyszek expected as far as I understand):
Here are some sample gctraces from that same run. After allocating the base heap and beginning the actual benchmark, it starts with a ~15 sec GC cycle and a bit more than 2000 MiB live heap and the process is using ~1.1 cores (which were the desired starting parameters from above):
~500 seconds later, it's a ~5-6 sec GC cycle, the live heap is ~2650 MB, and the process is using ~1.5 cores:
At ~1000 seconds, it's a ~2 sec GC cycle, live heap is ~3150 MB, and the process is continually using 3 cores (all of GOMAXPROCS):
In any event, your examples were more complex than that, but I'd be curious if (a) does that behavior shown in the benchmark align with your latest understanding of the current implementation, and (b) is there maybe one of your complex examples that you'd want to call out that doesn't seem to be explainable under that behavior? (E.g., for the one in #58106 (comment), it wasn't immediately obvious to me why it was using all 3 cores if the mem limit was 3GiB, but I also wasn't sure if I was tracking what changed when in that example). |
For contrast, here's re-running that same benchmark except with GOGC=100 and no GOMEMLIMIT.
The summary is you can see it run initially with ~100% CPU utilization (a little lower than the first benchmark above), and then as the leak progresses, the memory usage climbs but the CPU usage stays steady and the GC cycles take longer and longer, which I think is expected:
Some sample gctraces. After allocating the base heap and beginning the benchmark, it starts with a ~40 sec GC cycle and a bit more than 2000 MiB live heap:
At ~1000 seconds, it's a ~60 sec GC cycle, live heap is ~3150 MB, but in contrast to the benchmark in the comment just above, now the process is using ~6 GiB RSS while still using ~1 core:
Finally, here's running the benchmark with zeros for the memory allocations parameters, just to see the pure CPU work:
|
Hey @thepudds - thank you for your investigation, and sorry for the delay in responding. Yes, I think your tests reproduce the case. The surprising bits were:
The problem is that if:
Our services strongly prefer OOM (owners are used to this) to long performance degradation. Thus we (currently) need a tool to detect excessive GC usage & update either GOGC (currently) or GOMEMLIMIT (in the future) to "relax" the constraints and let the application be killed faster. |
Sorry for the long delayed reply here, I got caught up with a whole bunch of other things and kept putting off catching up on this. I've responded to a few of the discussion points below, but please correct me if I got anything wrong! @thepudds Thanks for the smaller reproducer! That's super helpful, especially for experimentation.
All your data is really useful. I do think we understand what's going on here, and from my perspective it's expected. I don't think we plan to do anything here in the near-term. Although it's unfortunate that Given that this state is persistent but not catastrophic, I think a reasonable workaround for your case would be to monitor a bunch of As for what metrics to watch, I think given that you know the nature of this problem, the best metric to watch would be live heap. Unfortunately, there isn't a super good way to grab that today. #56857 would fix that however, and I think we should land something for that this release cycle. I do think watching the GC CPU metrics is also reasonable, but I would pick a fairly large time window to average them over since they can be noisy and that might lead to unintended takedowns. I know that this isn't the most straightforward answer and it will unfortunately add complexity to your deployment, but the alternatives are basically (1) research work with an unknown endpoint for a more adaptive Let's keep this issue open so that more of the community can come in and comment further. If this issue is more widespread, I think it's reasonable to take more actions. So far I haven't been seeing that feedback through the channels available to me, but I'm keeping this in mind going forward.
Note that the third number in e.g. The 50% CPU limit does not apply to this "idle time" optimization, because the running application can readily take it back if it needs it.
I made a note to myself to take a closer look at this. |
Actually, sorry, I guess I may have spoke too soon. Overall I stand by what I said in my last comment, but I just took a closer look at the GC trace in #58106 (comment) and noticed something that might be a bug? So, AFAICT, this is the point where the memory limit is set: (right?)
But what's odd is you said you set FTR I don't see the same thing in @thepudds' reproducer; the heap goal does seem to be closer to 3 GiB. |
Oh, one other useful metric to watch (though I still think live heap is best in this particular case) would be You can sample this at the same time as |
It's certainly possible I fat-fingered or misreported something. I don't see a similar drop in the other traces, so perhaps it was a typo. What would this mean - that there's some overhead somewhere that GC cannot clear?
Hm, this is fair, but also:
Oh, no - agreed that this would be too much. The fewer knobs, the better. I liked how you gave yourself some flexibility in the GOMEMLIMIT design ("it might overshoot"), so at most, I hoped for convincing you to overshoot faster :)
Sounds great, thanks! |
Yeah, basically.
Yeah, that's not good. Impact on CFS quotas from idle GC workers is something I could absolutely see exacerbating this issue. What is the quota set to for your applications above? Does increasing the quota (to say, several full cores) show better behavior in the degraded state? FTR, I'm not a fan of the idle-priority GC workers. See #44163 which originally proposed removing them entirely. I walked that back to just disabling them in some circumstances because they can improve performance and that's hard to take away entirely. |
Hm. Perhaps the idle-priority GC workers should also be limited, like assists. Originally, I treated the idle-priority GC worker time as idle time, but later added actual idle time tracking and just lumped the idle-priority GC worker time into that. But I/O bound applications might end up using substantially more CPU than 50%. This isn't really a problem in a vacuum, but might be a problem with quotas, autoscaling systems, or in a cotenant situation. |
Hi @mknyszek
Maybe I'm off base here, but FWIW, I think it is expected for the CPU usage to grow non-linearly. Using a simplified model of GC costs (including ignoring non-heap memory and so on):
Does that sound plausible? There might be other things going on too, or a bug or similar, but the main point is I don't think linear GC CPU growth is expected as one approaches the mem limit at a constant rate... |
You are correct that a constant growth will result in a non-constant increase in GC CPU growth and your reasoning is sound to me. I was just leaving myself a mental note to look at that GC trace in general since I was still catching up (I hadn't looked closely at every log yet). 😅 What I wanted to double-check is if that was happening before or after the push past the limit, but you already annotated the log with that info so yup! It's pretty clear this is WAI. Thanks. |
Hi @mknyszek, I wanted to briefly follow-up on this portion of my comment above from Feb:
With the new runtime metrics that landed this week in #56857 (thanks @felixge!), do you think it is now possible to sum the right metrics and have something that is fairly directly comparable to the GOMEMLIMIT, including the metrics involved in the sum being at least somewhat time-aligned? (It's likely impossible to be perfectly time-aligned, but ideally it'd be better than needing to do X+Y+Z, where X and Y are say from last GC cycle but Z is close to a current measure). The intent again is to put enough data into the application's hands via observability so that the application can make decisions and take action, which in turn might simplify what's expected of the runtime (without the application needing to do something like pad an extra 10-15% because it can't get a good measure of some of the material overheads). One other question is I think the intent of the runtime metrics is that they are lightweight enough that polling them on some even reasonably frequent time tick would be a tiny cost? CC @aclements |
Just to clarify, comparing against GOMEMLIMIT in this case is straightforward: we already export these metrics and document that it's what the Go runtime is trying to maintain. (See https://pkg.go.dev/runtime/debug#SetMemoryLimit; specifically, However, that doesn't help you if
I believe Here's the recipe:
This expression represents the "permanent" parts of Go runtime memory use that the runtime can't do anything about. This is the same computation the runtime uses. If this is some high fraction of the memory limit, you can be sure you'll be executing GCs often. This is a fairly generic expression; I don't expect it to change too much over time. The fraction you pick as your threshold for "too much" will need to be dialed in to taste. You also probably want to be able to tolerate brief increases in this value, otherwise you're basically just back to where you started before FTR, all the metrics above are as real-time as they can be.
Totally agree. Beyond exposing a new knob, I'm not sure there's a lot we can do. A lot of this "let's die in this scenario" is going to be tuned to taste for each application. Having a "health checker" goroutine seems like a reasonable way to tackle issues like this.
In this case (
Correct. There's no hefty global synchronization operation involved, and the operation itself is pretty fast (order of tens of microseconds at worst), so you can sample it fairly frequently. |
Thanks @thepudds and @mknyszek! @cdvr1993 said he'll take a look. Rather than implementing a suicide-goroutine, he's planning on:
All credit for the above goes to @cdvr1993, I just basically copy-pasted his thoughts.
Huh, a bit extensive, but we'll try. I imagine convincing you to emit a future-proof |
An alternative to increasing GOMEMLIMIT is to switch to a GOGC-based GC trigger once you get close to the limit. This is effectively just a softer limit, but defined in terms of memory rather than CPU usage (which may be less noisy). You can calculate the "effective" GOGC value at any given time like so:
EDIT: I previous wrote this incorrectly and @cdvr1993 corrected me below. Updated here in case anyone looks at this comment first. and compare that against your target minimum GOGC, say for example 10. If that value ever goes below 10, set SetGCPercent to 10 and set GOMEMLIMIT back to MaxInt64. Though, I'm curious to hear what you learn from the CPU-based thresholds, too. The above is just somewhat tried-and-true (the old SetMaxHeap experiment worked like this, just internally to the runtime).
It's not the shortest expression, and there are certainly some subtleties. I think you're right that exporting a metric for this might be a good idea. I'll propose something for next release. |
That's an interesting idea. We haven't really tried max cpu threshold yet. The only threshold that we had before was to maintain cpu usage below a certain threshold if we had the memory to do so. So I'll five a try to your recommendation, that way I don't need to keep changing the memlimit and just let it grow by itself. Thanks. |
I thinks the correct formula is:
or am I missing something? |
Er, yes. That's right. 😅 Updated my comment above. |
I've also occasionally observed pathological CPU utilization among a few equally-loaded, identically configured pods forming a kafka consumer group. All of them had almost no real work to do. Most of the pods had 3-5% CPU utilization, and memory use less than GOMEMLIMIT (50-90% of GOMEMLIMIT is typically observed). However, for unclear reasons, a couple pods might get into > GOMEMLIMIT usage (sometimes by nearly double) very quickly after creation, with CPU pegged at 96% or more. A trace revealed that these pods were spending almost all CPU time towards garbage collection. In my case, it does not appear that there's a leak, and I suspect that live heap use isn't actually near or above GOMEMLIMIT, but more investigation would be needed. I suspect there's an anti-sweet-spot that the current GC algorithm can trigger, similar to what was observed with segmented stacks in the past. If pathological behavior is detectable, it would be preferable for these workloads to have a classic stop-the-world remedial GC cycle available in order to force usage back below GOMEMLIMIT, rather than to have cpu thrash persist indefinitely. |
This directly contradicts the high rate of garbage collection, unless memory is being squeezed from some other direction. For example, the sudden creation of a lot of goroutines. (But even that is still unlikely to be the problem, unless the number of goroutines is in the millions or more.) If you have some time, I'd appreciate if you could confirm this. If the rate of garbage collection is increasing without memory being squeezed in any way, that may just be a bug somewhere in the GC. A good diagnostic for confirming this would be a GC trace (
I'm not sure what you mean by this. A STW GC isn't going to change the rate of garbage collection, which is the core problem in these extreme cases. The concurrent garbage collection cycle already has other mechanisms to ensure the limit isn't exceeded, like GC assists. (You can kind of think of the assist rate as a slider between a fully concurrent collector that doesn't impact mutator execution and a fully STW collector that forces every goroutine to help the garbage collector until all work is done, much like a STW GC.) Also I'm not sure I follow what you mean by this idea of an extra "remedial" garbage collection. If the live heap is high, or memory is being squeezed from some other direction, then an extra garbage collection isn't going to help alleviate the problem. (In fact, that's kind of already what the GC is trying to do when the rate of garbage collection is increasing.) FWIW, this "pathological behavior" is already detected and mitigated to some degree. The absolute worst-case GC death spiral is cut off by a GC CPU limiter that allows the program to exceed |
@extemporalgenome Just circling back to this to see if you have any more information. If you suspect a bug, please file a new issue! Thanks. |
What version of Go are you using (
go version
)?go1.19.5, linux/amd64
Does this issue reproduce with the latest release?
yes
What did you do?
I ran a slowly leaking memory application with GOMEMLIMIT and GOGC=-1.
What did you expect to see?
I expected the application to run fine for a while. After a few days, I expected the CPU utilization to increase gradually (slightly, by <25-50%). I then the application to OOMed by the kernel.
What did you see instead?
After the initial expected GC CPU utilization increase, the utilized CPU increased dramatically (75% of the CPU time available to the container). The application remained in a degraded state for a long time (5-8+ hours).
Report
This is a continuation of a conversation started in #48409 (comment) - I thought I'd start a new issue.
tldr;
Investigation
Hey @mknyszek, I come bringing more data. In summary:
I was testing two of our applications via dynamically adjusting GOMEMLIMIT.
1. A Latency-sensitive proxy in production.
Observations:
The container runs with 4 GOMAXPROCS (4 cores cgroup limits) and 10gb of memory. At the time of the tests, cgroup reported RSS was 9.2GiB.
The test was:
-- 9563676416 - normal
-- 9463676416 - increase to 150%
-- 9363676416 - increase to 300%
Container CPU utilization before the test was fairly stable for ~2 days, so I don't think any external factors affect the test. The previous spikes are me modifying the MEMLIMIT manually.
Benchmarks
I ran three benchmarks during "normal" and "high CPU" periods, (1) unlimited throughput, (2) 1000 RPS, (3)100 RPS. The throughput seems (surprisingly) only to be ~10% down; latency changes are visible though. This is the same host, with GOMEMLIMIT set at all times, without restarts.
Benchmarks
All benchmarks are "degraded" followed by "non-degraded".
Unlimited throughput:
1000 RPS:
100RPS:
2. A leader elected not-so-important application.
The leader allocates a lot. The backup containers gather some data (memory "leaks" on all), but don't allocate as much. Containers run with 3 CPUs, and 16 GiB memory.
The test: I restarted and set GOMEMLIMIT=4Gib on all containers. I let them run.
We see:
This case is interesting because it's similar to a "failover" scenario where the application might suddenly get 50% extra traffic.
We measure cgroups nr_throttled periods and calculate "healthiness".
I'll try to get gctraces over the next few days.
The text was updated successfully, but these errors were encountered: