Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/debug: soft memory limit #48409

Closed
mknyszek opened this issue Sep 15, 2021 · 42 comments
Closed

runtime/debug: soft memory limit #48409

mknyszek opened this issue Sep 15, 2021 · 42 comments

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Sep 15, 2021

Proposal: Soft memory limit

Author: Michael Knyszek

Summary

I propose a new option for tuning the behavior of the Go garbage collector by setting a soft memory limit on the total amount of memory that Go uses.

This option comes in two flavors: a new runtime/debug function called SetMemoryLimit and a GOMEMLIMIT environment variable. In sum, the runtime will try to maintain this memory limit by limiting the size of the heap, and by returning memory to the underlying platform more aggressively. This includes with a mechanism to help mitigate garbage collection death spirals. Finally, by setting GOGC=off, the Go runtime will always grow the heap to the full memory limit.

This new option gives applications better control over their resource economy. It empowers users to:

  • Better utilize the memory that they already have,
  • Confidently decrease their memory limits, knowing Go will respect them,
  • Avoid unsupported forms of garbage collection tuning.

Details

Full design document found here.

Note that, for the time being, this proposal intends to supersede #44309. Frankly, I haven't been able to find a significant use-case for it, as opposed to a soft memory limit overall. If you believe you have a real-world use-case for a memory target where a memory limit with GOGC=off would not solve the same problem, please do not hesitate to post on that issue, contact me on the gophers slack, or via email at [email protected]. Please include as much detail as you can.

@mknyszek mknyszek added this to the Go1.18 milestone Sep 15, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/350116 mentions this issue: design: add proposal for a soft memory limit

gopherbot pushed a commit to golang/proposal that referenced this issue Sep 21, 2021
For golang/go#48409.

Change-Id: I4e5d6d117982f51108dca83a8e59b118c2b6f4bf
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/350116
Reviewed-by: Michael Pratt <[email protected]>
@mpx
Copy link
Contributor

mpx commented Sep 21, 2021

Afaict, the impact of memory limit is visible once the GC is CPU throttled, but not before. Would it be worth exposing the current effective GOGC as well?

@mknyszek
Copy link
Contributor Author

@mpx I think that's an interesting idea. If GOGC is not off, then you have a very clear sign of throttling in telemetry. However, if GOGC=off I think it's harder to tell, and it gets blurry once the runtime starts bumping up against the GC CPU utilization limit, i.e. what does effective GOGC mean when the runtime is letting itself exceed the heap goal?

I think that's pretty close. Ideally we would have just one metric that could show, at-a-glance, "are you in the red, and if so, how far?"

@mknyszek mknyszek modified the milestones: Go1.18, Proposal Sep 22, 2021
@raulk
Copy link

raulk commented Sep 27, 2021

In case you find this useful as a reference (and possibly to include in "prior art"), the go-watchdog library schedules GC according to a user-defined policy. It can infer limits from the environment/host, container, and it can target a maximum heap size defined by the user. I built this library to deal with #42805, and ever since we integrated it into https://github.com/filecoin-project/lotus, we haven't had a single OOM reported.

@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Oct 13, 2021

@mknyszek what is the status of this?

@mknyszek
Copy link
Contributor Author

@rsc I believe the design is complete. I've received feedback on the design, iterated on it, and I've arrived at a point where there aren't any major remaining comments that need to be addressed. I think the big question at the center of this proposal is whether the API benefit is worth the cost. The implementation can change and improve over time; most of the details are internal.

Personally, I think the answer is yes. I've found that mechanisms that respects users' memory limits and that give the GC the flexibility to use more of the available memory are quite popular. Where Go users implement this themselves, they're left working with tools (like runtime.GC/debug.FreeOSMemory and heap ballasts) that have some significant pitfalls. The proposal also takes steps to mitigate the most significant costs of having a new GC tuning knob.

In terms of implementation, I have some of the foundational bits up for review now that I wish to land in 1.18 (I think they're uncontroversial improvements, mostly related to the scavenger). My next step is create a complete implementation and trial it on real workloads. I suspect that a complete implementation won't land in 1.18 at this point, which is fine. It'll give me time to work out any unexpected issues with the design in practice.

@rsc
Copy link
Contributor

rsc commented Oct 20, 2021

Thanks for the summary. Overall the reaction here seems overwhelmingly positive.

Does anyone object to doing this?

@kent-h
Copy link

kent-h commented Oct 26, 2021

I have some of the foundational bits up for review now that I wish to land in 1.18

I suspect that a complete implementation won't land in 1.18

@mknyszek I'm somewhat confused by this. At a high level, what are you hoping to include in 1.18, and what do you expect to come later?
(Specifically: will we have extra knobs in 1.18, or will these changes be entirely internal?)

@mknyszek
Copy link
Contributor Author

@kent-h The proposal has not been accepted, so the API will definitely not land in 1.18. All that I'm planning to land is work on the scavenger, to make it scale a bit better. This is useful in its own right, and it happens that the implementation of SetMemoryLimit as described in the proposal depends on it. There won't be any internal functionality pertaining to SetMemoryLimit in the tree in Go 1.18.

@rsc
Copy link
Contributor

rsc commented Oct 27, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Nov 3, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: runtime/debug: soft memory limit runtime/debug: soft memory limit Nov 3, 2021
@rsc rsc modified the milestones: Proposal, Backlog Nov 3, 2021
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/406574 mentions this issue: runtime: reduce useless computation when memoryLimit is off

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/406575 mentions this issue: runtime: update description of GODEBUG=scavtrace=1

gopherbot pushed a commit that referenced this issue May 20, 2022
For #48409.

Change-Id: I056afcdbc417ce633e48184e69336213750aae28
Reviewed-on: https://go-review.googlesource.com/c/go/+/406575
Reviewed-by: Michael Knyszek <[email protected]>
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
tjvc added a commit to tjvc/gauche that referenced this issue Jun 5, 2022
WIP implementation of a memory limit. This will likely be superseded
by Go's incoming soft memory limit feature (coming August?), but it's
interesting to explore nonetheless.

Each time we receive a PUT request, check the used memory. To calculate
used memory, we use runtime.ReadMemStats. I was concerned that it would
have a large performance cost, because it stops the world on every
invocation, but it turns out that it has previously been optimised.
Return a 500 if this value has exceeded the current max memory. We
use TotalAlloc do determine used memory, because this seemed to be
closest to the container memory usage reported by Docker. This is broken
regardless, because the value does not decrease as we delete keys
(possibly because the store map does not shrink).

If we can work out a constant overhead for the map data structure, we
might be able to compute memory usage based on the size of keys and
values. I think it will be difficult to do this reliably, though. Given
that a new language feature will likely remove the need for this work,
a simple interim solution might be to implement a max number of objects
limit, which provides some value in situations where the user can
predict the size of keys and values.

TODO:

* Make the memory limit configurable by way of an environment variable
* Push the limit checking code down to the put handler

golang/go#48409
golang/go@4a7cf96
patrickmn/go-cache#5
https://github.com/vitessio/vitess/blob/main/go/cache/lru_cache.go
golang/go#20135
https://redis.io/docs/getting-started/faq/#what-happens-if-redis-runs-out-of-memory
https://redis.io/docs/manual/eviction/
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410735 mentions this issue: doc/go1.19: adjust runtime release notes

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410734 mentions this issue: runtime: document GOMEMLIMIT in environment variables section

gopherbot pushed a commit that referenced this issue Jun 7, 2022
For #48409.

Change-Id: Ia6616a377bc4c871b7ffba6f5a59792a09b64809
Reviewed-on: https://go-review.googlesource.com/c/go/+/410734
Run-TryBot: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Chris Hines <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
gopherbot pushed a commit that referenced this issue Jun 7, 2022
This addresses comments from CL 410356.

For #48409.
For #51400.

Change-Id: I03560e820a06c0745700ac997b02d13bc03adfc6
Reviewed-on: https://go-review.googlesource.com/c/go/+/410735
Run-TryBot: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Chris Hines <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rabbbit
Copy link

rabbbit commented Jan 22, 2023

Hey @mknyszek - first of all, thanks for the excellent work; this is great.

I wanted to share our experience thinking about enabling this in production. It works great and exactly as advertised. Some well-maintained applications have enabled it with great success, and the usage is spreading organically.

We'd ideally want to enable it for everyone by default (a vast majority of our applications have plenty of memory available), but we're currently too afraid to do this. The reason is the death spirals you called in the proposal. Applications leaking memory, with GOMEMLIMIT, can get to a significantly degraded state. Paradoxically, those applications prefer to OOM, die quickly and be restarted than to struggle for a long time. The number of applications makes avoiding leaks unfeasible.

A part of the problem (perhaps) is that we lack a good enough way of setting the right limit. We cannot set it to 98-99% of the container memory because some other applications can be running there. But, if we set it to 90%, once we hit the death spiral situation, we're in a degraded state for too long - it can take hours for OOM, and in the meantime, we are at risk of all containers of an application entering the degraded state.

Another aspect is that our containers typically don't use close to all the available CPU time. So the assumption from the gc-guide, while true, has a slightly different result in practice:

The intuition behind the 50% GC CPU limit is based on the worst-case impact on a program with ample available memory. In the case of a misconfiguration of the memory limit, where it is set too low mistakenly, the program will slow down at most by 2x, because the GC can't take more than 50% of its CPU time away.

The GC might use at most 50% of the total CPU time, but it can end up using 2-3x more CPU than the actual application work. This is "GC degradation" would be hard to explain/sell to application owners.

We're also concerned with a "degradation on failover" situation - an application that might be okay usually, in case of a sudden increase in traffic, might end up in a death spiral. And this would be precisely the time we need to avoid those.

What we're doing now is:

  • most high-core applications have an internal GC tuner by @cdvr1993 described here. This work predates your work but is stable.
  • some applications are opting in for enabling GOMEMLIMIT independently.
  • we'd like to enable the GC tuning on more applications - ideally with GOMEMLIMIT to reduce the amount of custom code. Since we're afraid of the death spirals, though, we've discussed building a "lightweight" version of our tuner that would watch runtime stats (perhaps runtime/metrics: add /gc/heap/live:bytes #56857) that would dynamically limit the GC usage more aggressively and let applications die faster.

Hope this is useful. Again, thanks for the excellent work.

@mknyszek
Copy link
Contributor Author

Thanks for the detailed feedback and I'm glad it's working well for your overall!

Speaking broadly, I'd love to know more about what exactly this degraded state looks like. What is the downstream effect? Latency increase? Throughput decrease? Both? If you could obtain a GODEBUG=gctrace=1 (outputs to STDERR) of this degraded state, that would be helpful in identifying what if any next steps we should take.

We'd ideally want to enable it for everyone by default (a vast majority of our applications have plenty of memory available), but we're currently too afraid to do this. The reason is the death spirals you called in the proposal. Applications leaking memory, with GOMEMLIMIT, can get to a significantly degraded state. Paradoxically, those applications prefer to OOM, die quickly and be restarted than to struggle for a long time. The number of applications makes avoiding leaks unfeasible.

Choosing to die quickly over struggling for a long time is an intentional point in the design. In these difficult situations something has to give and we chose to make that memory.

But also if the scenario here is memory leaks, it's hard to do much about that without fixing the leak. The live heap will grow and eventually even without GOMEMLIMIT you'll OOM as well. GOMEMLIMIT isn't really designed to deal with a memory leak well (generally, we consider memory leaks to be a bug in long-running applications), and yeah I can see turning it on basically turning into "well, it just gets slower before it dies, and it takes longer to die," which may be worse than not setting a memory limit at all.

As for fixing memory leaks, we're currently planning some work on improving the heap analysis situation. I hope that'll make keeping applications leak-free more feasible in the future. (#57447)

(I recognize that encountering a memory leak bug at some point is inevitable, but in general we don't expect long-running applications to run under the expectation of memory leaks. I also get that it's a huge pain these days to debug them, but we're looking into trying to make that better with heap analysis.)

A part of the problem (perhaps) is that we lack a good enough way of setting the right limit. We cannot set it to 98-99% of the container memory because some other applications can be running there. But, if we set it to 90%, once we hit the death spiral situation, we're in a degraded state for too long - it can take hours for OOM, and in the meantime, we are at risk of all containers of an application entering the degraded state.

FTR that's what the runtime/debug.SetMemoryLimit API is for and it should be safe (performance-wise) to call with a relatively high frequency. Just to be clear, is this also the memory leak scenario?

The 90% case you're describing sounds like a misconfiguration to me; if the application's live heap is really close enough to the memory limit to achieve this kind of death spiral scenario, then the intended behavior is to die after a relatively short period, but it might not if it turns out there's actually plenty of available memory. However, this cotenant situation might not be ideal for the memory limit to begin with.

As a general rule, the memory limit, when used in conjunction with GOGC=off, is not a great fit for an environment where the Go program is potentially cotenant with others, and the others don't have predictable memory usage (or the Go application can't easily respond to cotenant changes). See https://go.dev/doc/gc-guide#Suggested_uses. In this case I'd suggest slightly overcommitting the memory limit to protect against many transient spikes in memory use (in your example here, maybe 95-96%), but set GOGC to something other than off.

The GC might use at most 50% of the total CPU time, but it can end up using 2-3x more CPU than the actual application work. This is "GC degradation" would be hard to explain/sell to application owners.

I'm not sure I follow. Are you describing a situation in which your application is using say, 25% CPU utilization, and the GC is eating up 50%?

We're also concerned with a "degradation on failover" situation - an application that might be okay usually, in case of a sudden increase in traffic, might end up in a death spiral. And this would be precisely the time we need to avoid those.

(Small pedantic note, but the 50% GC CPU limiter is a mechanism to cut off the death spiral; in general a death spiral means that the GC keeps taking on more and more of the CPU load until application progress stops entirely.)

I think it depends on the load you're expecting. It's always possible to construct a load that'll cause some form of degradation, even when you're not using the memory limit (something like a tight OOM loop as the service gets restarted would be what I would expect with just GOGC).

If the memory limit is failing to degrade gracefully, then that's certainly a problem and a bug on our side (perhaps even a design flaw somewhere!). (Perhaps this risk of setting a limit too low such that you sit in the degraded state for too long instead of actually falling over can be considered something like failing to degrade gracefully, and that suggests that even 50% GC CPU is trying too hard as a default. I can believe that but I'd like to acquire more data first.)

However, without more details about the scenario in question, I'm not sure what else we can do to alleviate the concern. One idea is a backpressure mechanism (#29696), but for now I think we've decided to see what others can build since this wisdom of this space seems to have shifted a few times over the last few years (e.g. what metric should we use? Memory? CPU? Scheduling latency? A combination? If so, what combination and weighted how? Perhaps it's very application-dependent?).

What we're doing now is:

As a final note, I just want to point out that at the end of the day, the memory limit is just another tool in the toolkit. If you can make some of your applications work better without it, I don't think that necessarily means it's a failure of the memory limit (sometimes it might be, but not always). I'm not saying that you necessarily think the memory limit should be used everywhere, just wanted to leave that here for anyone who comes looking at this thread. :)

@cdvr1993
Copy link

Hi @mknyszek

Regarding the 50% cpu limit... Unless we understand incorrectly it means it can use up to that CPU to avoid going over the soft limit, but for many of our applications anything more than 20% GC CPU can have a serious impact (mostly when on failover state). Currently, we dynamically change GOGC when there is memory available we tend to increase it, when there isn't we just keep decreasing it to ensure our own soft limit, but we have a minimum threshold and we allow different service owners to set their own minimum threshold. That's more or less what we are missing with Go soft limit.

We currently don't have an example using soft limit, but in the past we have had issues with GOGC being too low and this caused bigger problems than a few instances crashing due to OOM. So, based on that assumption we think the scenario would repeat with soft limit.

What would be nice is a way of modifying how much CPU the GC can take to ensure the soft limit? Or a minimum GOGC value so that service owners decide at what point they believe is better to OOM than the degradation caused to the elevated GC.

Or would you suggest is better to wait for #56857 to have a way to keep an eye on the size of live bytes, so that when it gets close to the soft limit make a decision of either eat the cost of GC or just OOM?

@rabbbit
Copy link

rabbbit commented Jan 24, 2023

Thanks for the detailed feedback and I'm glad it's working well for your overall!

Speaking broadly, I'd love to know more about what exactly this degraded state looks like. What is the downstream effect? Latency increase? Throughput decrease? Both? If you could obtain a GODEBUG=gctrace=1 (outputs to STDERR) of this degraded state, that would be helpful in identifying what if any next steps we should take.

Getting the traces to work in production would be hard. We have an HTTP handler to tune GOMEMLIMIT per container, so we can experiment with that with reasonable safety. There's no way to runtime way to enable traces, right?

That being said I can perhaps try to reproduce the same situation in staging. What we have seen in production was a significant CPU time utilization increase, leading to CPU throttling, leading to both latency increase and throughput decrease.

Below is screenshot of a "slowly leaking application" (more explained below) where we enabled GOMEMLIMIT temporarily. Note the CPU utilization increased significantly more than we expected - more than 50% of GOMAXPROCS.

image

We'd ideally want to enable it for everyone by default (a vast majority of our applications have plenty of memory available), but we're currently too afraid to do this. The reason is the death spirals you called in the proposal. Applications leaking memory, with GOMEMLIMIT, can get to a significantly degraded state. Paradoxically, those applications prefer to OOM, die quickly and be restarted than to struggle for a long time. The number of applications makes avoiding leaks unfeasible.

Choosing to die quickly over struggling for a long time is an intentional point in the design. In these difficult situations something has to give and we chose to make that memory.

But also if the scenario here is memory leaks, it's hard to do much about that without fixing the leak. The live heap will grow and eventually even without GOMEMLIMIT you'll OOM as well. GOMEMLIMIT isn't really designed to deal with a memory leak well (generally, we consider memory leaks to be a bug in long-running applications), and yeah I can see turning it on basically turning into "well, it just gets slower before it dies, and it takes longer to die," which may be worse than not setting a memory limit at all.

As for fixing memory leaks, we're currently planning some work on improving the heap analysis situation. I hope that'll make keeping applications leak-free more feasible in the future. (#57447)
(I recognize that encountering a memory leak bug at some point is inevitable, but in general we don't expect long-running applications to run under the expectation of memory leaks. I also get that it's a huge pain these days to debug them, but we're looking into trying to make that better with heap analysis.)

So I think you might be too optimistic vs what we see in our reality here (sorry:)). We:

  1. have applications that are leaking quick, they restart often, they need to be fixed. Those typically have higher priority, and can be diagnosed with some effort - I wouldn't actually call it pain though, profiles are typically helpful enough.
  2. "slowly leaking memory" applications that just very slowly accumulate memory as they run. These are actually low-priority - as long as the {release_frequency}>2-5*{time_to_oom}, fixing it will not get prioritized. Especially if some of the leaks are in gnarly bits like stat emission. This only becomes a problem during extended quiet periods - the expectation is still that the applications will crash rather than degrade.

In summary though, we strongly expect leaks to be around forever.

A part of the problem (perhaps) is that we lack a good enough way of setting the right limit. We cannot set it to 98-99% of the container memory because some other applications can be running there. But, if we set it to 90%, once we hit the death spiral situation, we're in a degraded state for too long - it can take hours for OOM, and in the meantime, we are at risk of all containers of an application entering the degraded state.

FTR that's what the runtime/debug.SetMemoryLimit API is for and it should be safe (performance-wise) to call with a relatively high frequency. Just to be clear, is this also the memory leak scenario?

Yeah, so we would need to continue running a custom tuner though, right? It also seems if we're tuning in "user-space", equivalent results can be achieved with GOGC and GOMEMLIMIT - right?

The 90% case you're describing sounds like a misconfiguration to me; if the application's live heap is really close enough to the memory limit to achieve this kind of death spiral scenario, then the intended behavior is to die after a relatively short period, but it might not if it turns out there's actually plenty of available memory. However, this cotenant situation might not be ideal for the memory limit to begin with.

As a general rule, the memory limit, when used in conjunction with GOGC=off, is not a great fit for an environment where the Go program is potentially cotenant with others, and the others don't have predictable memory usage (or the Go application can't easily respond to cotenant changes). See https://go.dev/doc/gc-guide#Suggested_uses. In this case I'd suggest slightly overcommitting the memory limit to protect against many transient spikes in memory use (in your example here, maybe 95-96%), but set GOGC to something other than off.

This is slightly more nuanced, (and perhaps offtopic) each of our containers runs with a "helper" process responsible for starting up and shipping logs and performing local health checks (it's silly. don't ask). The memory we need to reserve for it varies per application - thus, for small containers, 95% might not be enough. For larger applications, we can increase the limit, but for both cases, we'd likely still need to look at the log output dynamically.

It is not immediately clear to me how to tune the right value of GOGC combined with GOMEMLIMIT. But, more importantly, my understanding of GOMEMLIMIT is that no matter the GOGC value we can still hit the death-spiral situation.

The GC might use at most 50% of the total CPU time, but it can end up using 2-3x more CPU than the actual application work. This is "GC degradation" would be hard to explain/sell to application owners.

I'm not sure I follow. Are you describing a situation in which your application is using say, 25% CPU utilization, and the GC is eating up 50%?

Yeah, @cdvr1993 explained it in the previous comment too. If container has GOMAXPROCS=8, but utilized 3 at that time. Then we hit GOMEMLIMIT, and GC is allowed to (per our understanding) to use up to 4 cores, so GC is now using more CPU than the application. At the same time, anything above 80% CPU utilization (in our experience) results in dramatically increased latency.

We're also concerned with a "degradation on failover" situation - an application that might be okay usually, in case of a sudden increase in traffic, might end up in a death spiral. And this would be precisely the time we need to avoid those.

(Small pedantic note, but the 50% GC CPU limiter is a mechanism to cut off the death spiral; in general a death spiral means that the GC keeps taking on more and more of the CPU load until application progress stops entirely.)

Perhaps we need a different name here then:) What we've observed might not be a death spiral, but a degradation large enough to severely disrupt production. Even with the 50% limit.

I think it depends on the load you're expecting. It's always possible to construct a load that'll cause some form of degradation, even when you're not using the memory limit (something like a tight OOM loop as the service gets restarted would be what I would expect with just GOGC).

Yeah, the problem seems to occur for applications that are "mostly fine", with days between OOMs.

If the memory limit is failing to degrade gracefully, then that's certainly a problem and a bug on our side (perhaps even a design flaw somewhere!). (Perhaps this risk of setting a limit too low such that you sit in the degraded state for too long instead of actually falling over can be considered something like failing to degrade gracefully, and that suggests that even 50% GC CPU is trying too hard as a default. I can believe that but I'd like to acquire more data first.)

However, without more details about the scenario in question, I'm not sure what else we can do to alleviate the concern. One idea is a backpressure mechanism (#29696), but for now I think we've decided to see what others can build since this wisdom of this space seems to have shifted a few times over the last few years (e.g. what metric should we use? Memory? CPU? Scheduling latency? A combination? If so, what combination and weighted how? Perhaps it's very application-dependent?).

IMO it seems like what you built is "almost perfect". We just need the applications to "die faster" - the easiest changes that come to mind would be reducing the limit from 50%, to either something like 25% or a static value (2 cores?).

When I say "almost perfect" I mean it though - I suspect we could rollout the GOMEMLIMIT to 98% of our applications with great results and without a problem, but the remaining users would come after us with pitchforks. And that forces us to use the GOMEMLIMIT as an opt-in, which is very disappointing given the results we see in 98% of the applications.

Thanks for the thoughtful response!

@rabbbit
Copy link

rabbbit commented Jan 27, 2023

Hey @mknyszek @cdvr1993 I raised a new issue in #58106.

@rsc rsc removed this from Proposals May 3, 2023
@VEDANTDOKANIA
Copy link

@rabbbit @mknyszek @rsc we are facing one issue regarding the memory limit . We are setting the memory limit to 18GB in 24GB server but still GC runs very frequently and eats up 80 percent of CPU and memory used is only 4 to 5 GB max . Also memory limit is goroutine wise ? or how to set the same for whole program.

In the entry point of our application we have specified something like this :-

debug.SetMemoryLimit(int64(8* 1024 * 1024 * 1024))

Is this okay or we need to do something additional. Also where to set the optional unit as described in the documentation

@mknyszek
Copy link
Contributor Author

@VEDANTDOKANIA Unfortunately I can't help with just the information you gave me.

Firstly, how are you determining that the GC runs very frequently, and that it uses 80 percent of CPU? That's far outside of the bounds of what the GC should allow: there's an internal limiter to 50% of available CPU (as defined by GOMAXPROCS) that will prioritize using new memory over additional CPU usage beyond that point.

Please file a new issue with more details, ideally:

  • Platform
  • Go version
  • The environment you're running in (if in a container or cgroup, what is the container's CPU quota?)
  • The STDERR output of running your program with GODEBUG=gctrace=1.

Thanks.

Also memory limit is goroutine wise ? or how to set the same for whole program.

It's for the whole Go process.

In the entry point of our application we have specified something like this :-

debug.SetMemoryLimit(int64(8* 1024 * 1024 * 1024))

That should work fine, but just so we're on the same page, that will set an 8 GiB memory limit. Note that the GC may execute very frequently (but again, still capped at roughly 50%) if this value is set smaller than the baseline memory use your program requires.

Also where to set the optional unit as described in the documentation

The optional unit is part of the GOMEMLIMIT environment variable that Go programs understand. e.g. GOMEMLIMIT=18GiB.

akshayjshah added a commit to connectrpc/connect-go that referenced this issue Jul 26, 2023
The more I look at it, the more convinced I am that this option is a bad
idea. It's very unclear what it's trying to accomplish, and there are
many better options:

* Limiting heap usage? Use the upcoming soft memory limit APIs
  (golang/go#48409).
* Limiting network I/O? Use `http.MaxBytesReader` and set a per-stream
  limit.
* Banning "large messages"? Be clear what you mean, and use
  `unsafe.SizeOf` or `proto.Size` in an interceptor.

Basically, the behavior here (and in grpc-go) is an incoherent middle
ground between Go runtime settings, HTTP-level settings, and a vague "no
large messages" policy.

I'm doubly sure we should delete this because we've decided not to
expose the metrics to track how close users are to the configured limit
:)
@golang golang locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants