-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033
Conversation
… metrics by default. Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]>
I drop most of these new metrics, all the points listed above do apply to me - there’s a lot of them and I either don’t know how to use them or they don’t tell me much (because they’re so low level). Adding drop rules to all scrapes is a pain, makes me wish there was a global drop config that applies to all targets. |
If most of the additional cardinality is from histograms, can we just disable those? Once you eliminate those, the cardinality should be just about what it was before.
I'm curious to know which you find to be too low level. The majority of the non-histogram metrics are just the same metrics from Coming down the pipeline in the Go world is golang/go#48409 which is going to expose new, useful metrics for identifying when the GC is being throttled to avoid death spirals. This is useful for diagnosing OOMs, linking them back to a bad configuration push or a sudden increase in memory use (or spikier memory use). It does not scale for the Go team to go to each metrics exporter (Prometheus, etc.) and add these metrics. That's a big motivation behind slurping up all metrics programmatically. I worry that new metrics will only be added after something like this has already impacted someone, which is what I wanted to avoid in the first place. From my perspective, a lot of the issues that cropped up over time are the result of my initial flawed integration of runtime/metrics into Prometheus. The flaws, I think, were mostly due to my inexperience with the Prometheus ecosystem generally (going back to the poor choice of histogram representation). As a result, turning off exporting new metrics by default completely in addition to switching to the old metrics feels like throwing the baby out with the bathwater. That being said, I recognize that this integration of runtime/metrics has been causing you (the maintainers) problems, and this is a straightforward way to resolve that. I apologize for not being fully present to address these issues. I also recognize that since this has already been merged my point here is somewhat moot. I do hope you'll reconsider the decision to not export new metrics by default (or at least the non-histogram metrics) in the future. |
Things like:
Just not sure what to do with these really. In most cases I just need cpu/memory usage (including GC stats). If resource usage is too high I usually look at pprofs to see what's going on. |
@mknyszek Foremost, we really appreciate your contribution and any help from the Go team. I believe we just need time to polish the implementation and make sure it satisfies most of the users' needs. And I hope you'll help us to reach that goal. We haven't reverted the feature or anything, we just made it optional as the Go collector itself optional. Besides minor inconvenience that we need to deal with the histogram problem, which we will have a solution soon thanks to sparse histograms (@beorn7 as the author, mentioned it a couple of times). The cardinality is a huge issue to deal with, that has direct cost consequences, out in the wild. On top of that, we need to make another extra effort to help our user base to how to deal with these new metrics. We can provide meaningful alerts and example dashboards so that they can have immediate value out of them. So all in all this is just a tiny bit of set back, we'll get more prepared, and then we can reconsider making the new collector default. Until then, for the users who want to enable it, the option is there. I believe we can improve the documentation and be more transparent about the pros and cons of enabling it, and users can choose if they want to opt in. |
@kakkoyun nailed explaining it. I think we need to essentially take a more controllable introduction of new metrics. There are literally books written about what metrics you should use for monitoring Go. We can't change it overnight, even if it's "only" naming that changes. Naming is what tools depend on, not only humans, even if the name is clearer now. By implicitly adding new ones on top of old ones we are not helping with adoption much, because people in most cases won't notice new ones (no good way of discovering them) - other than having to pay more for those metrics. We need to work around this with documentation, tutorials and tooling changes. NOTE: It's also not only histograms that are problematic. An overlap of extra metrics just to rename them and allow ppl to migrate to them is significant. Again, without the extra work mentioned above, it will not be useful for others.
Yes - and it would be awesome to have those! For now as opt-in. We might at some point consider more granular options or even a little bit bigger default for truly popular and useful metrics for everyone in future.
It does not scale for Go Team, but:
Don't worry, it's our fault (my and @kakkoyun) that we completely missed those shortcomings. And I truly believe we could fix this (and we will, before releasing of new version) within days, it's not a biggie. Still, even with fixes, we are adding too many new metrics without value (the majority of people/tooling won't know about them and won't know how to use them).
Again, no need to be sorry @mknyszek your help was life-saving here. It's ultra-important to work with Go Team on what's coming to Go runtime in terms of instrumentation 💪🏽 It's was amazing to have your contributions and let's move those things forward, just a little bit more safely and opt-in - until we graduate some things to default! (:
Defnitely! |
Renaming metrics and just trying to stay on top of that is really a huge pain point for my org. It always takes us ages to upgrade node_exporter because we need to tripple-check that all metrics that got renamed were accounted for in the process. We had alerts broken for months because the path from "version of node_exporter on some server" to "we alert when bad things happen" is a long piece of string that looks like one of those crazy cartoon contraptions - you need to scrape it, then we often have recording rules to aggregate them, then we have federation that pulls aggregates, then we have more aggregation to build global aggregates, then we have alerts. On every step we need to ensure that names are correct, that we filter on correct labels and that we're scraping correct servers and services. Since you don't get any errors from Prometheus when metrics aren't there (because it's perfectly fine to ask for something that doesn't exist) it's all a very fragile ecosystem that requires care and attention. This makes renaming expensive. If you also make changes to labels (add or remove some) it's extra expensive. |
Thanks for the replies and the additional context. I ultimately don't disagree with your decision and I certainly don't hold it against you. You do what you need to do for yourselves and the project. I'm also glad it sounds like we're on the same page about where we might want this to be in the future. :) It's clear to me (and this sounds like what you're ultimately getting at) that rolling this out needs a much more fully fleshed out strategy. I'll be thinking about that going forward, for Prometheus and beyond. Your feedback is invaluable. A few replies:
|
… metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]>
* Cut v1.12.0 Signed-off-by: Kemal Akkoyun <[email protected]> * Bump the day Signed-off-by: Kemal Akkoyun <[email protected]> * Make the Go 1.17 collector thread-safe (#969) * Use simpler locking in the Go 1.17 collector (#975) A previous PR made it so that the Go 1.17 collector locked only around uses of rmSampleBuf, but really that means that Metric values may be sent over the channel containing some values from future metrics.Read calls. While generally-speaking this isn't a problem, we lose any consistency guarantees provided by the runtime/metrics package. Also, that optimization to not just lock around all of Collect was premature. Truthfully, Collect is called relatively infrequently, and its critical path is fairly fast (10s of µs). To prove it, this change also adds a benchmark. name old time/op new time/op delta GoCollector-16 43.7µs ± 2% 43.2µs ± 2% ~ (p=0.190 n=9+9) Note that because the benchmark is single-threaded it actually looks like it might be getting *slightly* faster, because all those Collect calls for the Metrics are direct calls instead of interface calls. Signed-off-by: Michael Anthony Knyszek <[email protected]> * API client: make http reads more efficient (#976) Replace `io.ReadAll` with `bytes.Buffer.ReadFrom`. Both need to resize a buffer until they have finished reading; the former increases by 1.25x each time while the latter uses 2x. Also added a benchmark to demonstrate the benefit: name old time/op new time/op delta Client/4KB-8 35.9µs ± 4% 35.3µs ± 3% ~ (p=0.310 n=5+5) Client/50KB-8 83.1µs ± 8% 69.5µs ± 1% -16.37% (p=0.008 n=5+5) Client/1000KB-8 891µs ± 6% 750µs ± 0% -15.83% (p=0.016 n=5+4) Client/2000KB-8 1.74ms ± 2% 1.35ms ± 1% -22.72% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Client/4KB-8 20.2kB ± 0% 20.4kB ± 0% +1.26% (p=0.008 n=5+5) Client/50KB-8 218kB ± 0% 136kB ± 0% -37.65% (p=0.008 n=5+5) Client/1000KB-8 5.88MB ± 0% 2.11MB ± 0% -64.10% (p=0.008 n=5+5) Client/2000KB-8 11.7MB ± 0% 4.2MB ± 0% -63.93% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Client/4KB-8 75.0 ± 0% 72.0 ± 0% -4.00% (p=0.008 n=5+5) Client/50KB-8 109 ± 0% 98 ± 0% -10.09% (p=0.079 n=4+5) Client/1000KB-8 617 ± 0% 593 ± 0% -3.89% (p=0.008 n=5+5) Client/2000KB-8 1.13k ± 0% 1.09k ± 0% -3.27% (p=0.008 n=5+5) Signed-off-by: Bryan Boreham <[email protected]> * Reduce granularity of histogram buckets for Go 1.17 collector (#974) The Go runtime/metrics package currently exports extremely granular histograms. Exponentially bucket any histogram with unit "seconds" or "bytes" instead to dramatically reduce the number of buckets, and thus the number of metrics. This change also adds a test to check for expected cardinality to prevent cardinality surprises in the future. Signed-off-by: Michael Anthony Knyszek <[email protected]> * Cut v1.12.1 (#978) * Cut v1.12.1 Signed-off-by: Kemal Akkoyun <[email protected]> * Apply review suggestions Signed-off-by: Kemal Akkoyun <[email protected]> * Fix deprecated `NewBuildInfoCollector` API Update `examples/random/main.go`: `prometheus.NewBuildInfoCollector` is deprecated. Use `collectors.NewBuildInfoCollector` instead. Signed-off-by: alissa-tung <[email protected]> * gocollector: Added options to Go Collector for changing the (#1031) * Renamed files. Signed-off-by: Bartlomiej Plotka <[email protected]> * gocollector: Added options to Go Collector for diffetent collections. Fixes #983 Also: * fixed TestMemStatsEquivalence, it was noop before (: * Removed gc_cpu_fraction metric completely, since it's not working completely for Go1.17+ Signed-off-by: Bartlomiej Plotka <[email protected]> * gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <[email protected]> * prometheus: Fix convention violating names for generated collector metrics (#1048) * Fix convention violating names for generated collector metrics Signed-off-by: Kemal Akkoyun <[email protected]> * Add new Go collector example Signed-off-by: Kemal Akkoyun <[email protected]> * Remove -Inf buckets from go collector histograms (#1049) * Remove -Inf buckets from go collector histograms Signed-off-by: Kemal Akkoyun <[email protected]> * Update prometheus/collectors/go_collector_latest_test.go Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Simplify Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> * Cut v1.12.2 (#1052) * Cut v1.12.2 Signed-off-by: Kemal Akkoyun <[email protected]> * Apply suggestions Signed-off-by: Kemal Akkoyun <[email protected]> * Update CHANGELOG.md Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: Michael Knyszek <[email protected]> Co-authored-by: Bryan Boreham <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: alissa-tung <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]>
Fixes #967
I am not 100% sure we want this, but I would like to start this discussion.
On one hand, with @mknyszek we wanted new runtime/metrics naming and extra metrics to be exposed by default, so people can leverage it.
On the other hand:
Invalid le label value on go_gc_pauses_seconds_total_bucket #995
New GoCollector metrics in 1.12 do not pass
promtool check metrics
lint #1026Let me know what you think. I would revert the default to old behaviour IMO to be safe here and save ppl from the accidental cost impact it can cause.
Signed-off-by: Bartlomiej Plotka [email protected]