Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

Merged
merged 1 commit into from
Apr 13, 2022

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Apr 13, 2022

Fixes #967

I am not 100% sure we want this, but I would like to start this discussion.

On one hand, with @mknyszek we wanted new runtime/metrics naming and extra metrics to be exposed by default, so people can leverage it.

On the other hand:

Let me know what you think. I would revert the default to old behaviour IMO to be safe here and save ppl from the accidental cost impact it can cause.

Signed-off-by: Bartlomiej Plotka [email protected]

… metrics by default.

Fixes #967

Signed-off-by: Bartlomiej Plotka <[email protected]>
@bwplotka bwplotka changed the title gocollector: Reverted client_golang v1.12 addition of runtime/metrics… gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default Apr 13, 2022
@bwplotka bwplotka requested review from beorn7 and kakkoyun April 13, 2022 18:00
@prymitive
Copy link
Contributor

I drop most of these new metrics, all the points listed above do apply to me - there’s a lot of them and I either don’t know how to use them or they don’t tell me much (because they’re so low level).

Adding drop rules to all scrapes is a pain, makes me wish there was a global drop config that applies to all targets.

@kakkoyun kakkoyun merged commit 11ee9ad into main Apr 13, 2022
@kakkoyun kakkoyun deleted the revert-runtime-metrics-default branch April 13, 2022 18:43
@mknyszek
Copy link
Contributor

If most of the additional cardinality is from histograms, can we just disable those? Once you eliminate those, the cardinality should be just about what it was before.

there’s a lot of them and I either don’t know how to use them or they don’t tell me much (because they’re so low level).

I'm curious to know which you find to be too low level. The majority of the non-histogram metrics are just the same metrics from MemStats with slightly different names (IMO more clear, because for example all the memory metrics are now orthogonal to one another). Lots of these are very general as well: reasonably granular memory accounting plus things that come in handy when you least expect (e.g. GC cycle rates and total memory allocated can help diagnose lots of issues after the fact, especially in long-term trends).

Coming down the pipeline in the Go world is golang/go#48409 which is going to expose new, useful metrics for identifying when the GC is being throttled to avoid death spirals. This is useful for diagnosing OOMs, linking them back to a bad configuration push or a sudden increase in memory use (or spikier memory use).

It does not scale for the Go team to go to each metrics exporter (Prometheus, etc.) and add these metrics. That's a big motivation behind slurping up all metrics programmatically. I worry that new metrics will only be added after something like this has already impacted someone, which is what I wanted to avoid in the first place.

From my perspective, a lot of the issues that cropped up over time are the result of my initial flawed integration of runtime/metrics into Prometheus. The flaws, I think, were mostly due to my inexperience with the Prometheus ecosystem generally (going back to the poor choice of histogram representation). As a result, turning off exporting new metrics by default completely in addition to switching to the old metrics feels like throwing the baby out with the bathwater.

That being said, I recognize that this integration of runtime/metrics has been causing you (the maintainers) problems, and this is a straightforward way to resolve that. I apologize for not being fully present to address these issues.

I also recognize that since this has already been merged my point here is somewhat moot. I do hope you'll reconsider the decision to not export new metrics by default (or at least the non-histogram metrics) in the future.

@prymitive
Copy link
Contributor

I'm curious to know which you find to be too low level.

Things like:

  • go_gc_heap_allocs_by_size_bytes_total_bucket
  • go_gc_heap_frees_by_size_bytes_total_bucket
  • go_sched_latencies_seconds_bucket
  • some go_memory_classes_* or go_memstats_*

Just not sure what to do with these really. In most cases I just need cpu/memory usage (including GC stats). If resource usage is too high I usually look at pprofs to see what's going on.

@kakkoyun
Copy link
Member

@mknyszek Foremost, we really appreciate your contribution and any help from the Go team. I believe we just need time to polish the implementation and make sure it satisfies most of the users' needs. And I hope you'll help us to reach that goal.

We haven't reverted the feature or anything, we just made it optional as the Go collector itself optional.

Besides minor inconvenience that we need to deal with the histogram problem, which we will have a solution soon thanks to sparse histograms (@beorn7 as the author, mentioned it a couple of times). The cardinality is a huge issue to deal with, that has direct cost consequences, out in the wild.

On top of that, we need to make another extra effort to help our user base to how to deal with these new metrics. We can provide meaningful alerts and example dashboards so that they can have immediate value out of them.

So all in all this is just a tiny bit of set back, we'll get more prepared, and then we can reconsider making the new collector default. Until then, for the users who want to enable it, the option is there. I believe we can improve the documentation and be more transparent about the pros and cons of enabling it, and users can choose if they want to opt in.

@bwplotka
Copy link
Member Author

@kakkoyun nailed explaining it. I think we need to essentially take a more controllable introduction of new metrics. There are literally books written about what metrics you should use for monitoring Go. We can't change it overnight, even if it's "only" naming that changes. Naming is what tools depend on, not only humans, even if the name is clearer now. By implicitly adding new ones on top of old ones we are not helping with adoption much, because people in most cases won't notice new ones (no good way of discovering them) - other than having to pay more for those metrics. We need to work around this with documentation, tutorials and tooling changes.

NOTE: It's also not only histograms that are problematic. An overlap of extra metrics just to rename them and allow ppl to migrate to them is significant. Again, without the extra work mentioned above, it will not be useful for others.

Coming down the pipeline in the Go world is golang/go#48409 which is going to expose new, useful metrics for identifying when the GC is being throttled to avoid death spirals. This is useful for diagnosing OOMs, linking them back to a bad configuration push or a sudden increase in memory use (or spikier memory use).

Yes - and it would be awesome to have those! For now as opt-in. We might at some point consider more granular options or even a little bit bigger default for truly popular and useful metrics for everyone in future.

It does not scale for the Go team to go to each metrics exporter (Prometheus, etc.) and add these metrics. That's a big motivation behind slurping up all metrics programmatically. I worry that new metrics will only be added after something like this has already impacted someone, which is what I wanted to avoid in the first place.

It does not scale for Go Team, but:

  • This situation proved that it's so easy to make the unexpected impactful situation for users as we should probably treat metric change much more seriously than programmatic API change in compatibility terms. Go community that works on those metrics might treat it lightly and rename or add an enormous histogram because it's important for their use cases. Not many will remember that this has serious consequences to those exporters like client_golang between 1.12-now that immediately transfer all changes of runtime/metrics to thousand production clusters without any review of metric experts from exporter domain.
  • You are not alone. If a metric will be useful it will be added as soon as PR to Go is merged and the version with the new runtime/metric change is released. It's nice to have those things consumed on more data driven "need" base, and not because a developer thinks it's useful.

From my perspective, a lot of the issues that cropped up over time are the result of my initial flawed integration of runtime/metrics into Prometheus. The flaws, I think, were mostly due to my inexperience with the Prometheus ecosystem generally (going back to the poor choice of histogram representation). As a result, turning off exporting new metrics by default completely in addition to switching to the old metrics feels like throwing the baby out with the bathwater.

Don't worry, it's our fault (my and @kakkoyun) that we completely missed those shortcomings. And I truly believe we could fix this (and we will, before releasing of new version) within days, it's not a biggie. Still, even with fixes, we are adding too many new metrics without value (the majority of people/tooling won't know about them and won't know how to use them).

That being said, I recognize that this integration of runtime/metrics has been causing you (the maintainers) problems, and this is a straightforward way to resolve that. I apologize for not being fully present to address these issues.

Again, no need to be sorry @mknyszek your help was life-saving here. It's ultra-important to work with Go Team on what's coming to Go runtime in terms of instrumentation 💪🏽 It's was amazing to have your contributions and let's move those things forward, just a little bit more safely and opt-in - until we graduate some things to default! (:

I do hope you'll reconsider the decision to not export new metrics by default (or at least the non-histogram metrics) in the future.

Defnitely!

@prymitive
Copy link
Contributor

NOTE: It's also not only histograms that are problematic. An overlap of extra metrics just to rename them and allow ppl to migrate to them is significant. Again, without the extra work mentioned above, it will not be useful for others.

Renaming metrics and just trying to stay on top of that is really a huge pain point for my org. It always takes us ages to upgrade node_exporter because we need to tripple-check that all metrics that got renamed were accounted for in the process. We had alerts broken for months because the path from "version of node_exporter on some server" to "we alert when bad things happen" is a long piece of string that looks like one of those crazy cartoon contraptions - you need to scrape it, then we often have recording rules to aggregate them, then we have federation that pulls aggregates, then we have more aggregation to build global aggregates, then we have alerts. On every step we need to ensure that names are correct, that we filter on correct labels and that we're scraping correct servers and services. Since you don't get any errors from Prometheus when metrics aren't there (because it's perfectly fine to ask for something that doesn't exist) it's all a very fragile ecosystem that requires care and attention. This makes renaming expensive. If you also make changes to labels (add or remove some) it's extra expensive.
This is such a big problem for us that we've built and run a sidecar on every Prometheus server that keep testing all Prometheus rules to tell us if any metrics used in queries are missing from Prometheus - github.com/cloudflare/pint - just so that we can upgrade node_exporter (or any other exporter) and if that breaks any alerts we'll have a chance of noticing. It still leaves us with lots of Grafana dashboards to update, but that's less critical.

@mknyszek
Copy link
Contributor

Thanks for the replies and the additional context. I ultimately don't disagree with your decision and I certainly don't hold it against you. You do what you need to do for yourselves and the project. I'm also glad it sounds like we're on the same page about where we might want this to be in the future. :) It's clear to me (and this sounds like what you're ultimately getting at) that rolling this out needs a much more fully fleshed out strategy. I'll be thinking about that going forward, for Prometheus and beyond. Your feedback is invaluable.

A few replies:

  • @prymitive I recognize the name changes are intrusive, but it was really supposed to be a one-time thing. Things are really bad if we're making even one name change per release, without notifying ahead of time. It's supposed to be purely additive, at a slow rate. But clearly even the one-time thing is too intrusive for now.
  • @bwplotka @kakkoyun I didn't fully appreciate earlier how baked in the existing metric names are into everything. Looks like there's a lot of work to do in fully making the switch to runtime/metrics from MemStats and GCStats in general. To be clear, I didn't expect everyone to suddenly start using runtime/metrics, but I underestimated how far these concepts push up beyond systems like Prometheus, and how widespread they are.
  • @bwplotka RE: the metric overlap, earlier on in the implementation I was looking for something like an alias mechanism that could be used for migrations, akin to Go's type alias feature. Not sure if anything like that is on the roadmap for Prometheus at all, but it seems like it could save a lot of headache in even changes in application-specific metrics.
  • @prymitive RE: metrics being too low-level, the ones you listed surprised me. It sounds like the Go project needs to do a better job documenting and publicizing things. Part of my plan for runtime/debug: soft memory limit golang/go#48409 is to write some better GC documentation, geared toward users trying to understand a Go application, so that should help a little.
  • A few of you have also mentioned that there isn't enough benefit to the renaming and overlapping. I can accept given the current state of things. For some additional context, to me, there was always a clear long-term benefit for users: when something goes wrong, or you need to investigate something, you'll never have to find yourself wishing you weren't tracking some metric. This is the same principle behind why memory profiling is always on in every Go binary, and needs to be turned off explicitly. In terms of switching to runtime/metrics, there's a short-term cost for the switch to this new world (to some degree, this is unavoidable), and a long-term cost in carrying metrics that you might not need all the time. But, this cost is outweighed by the engineering costs of needing to redeploy and reproduce issues: it's always right there.

kakkoyun pushed a commit that referenced this pull request May 13, 2022
… metrics by default. (#1033)

Fixes #967

Signed-off-by: Bartlomiej Plotka <[email protected]>
kakkoyun added a commit that referenced this pull request Jul 6, 2022
* Cut v1.12.0

Signed-off-by: Kemal Akkoyun <[email protected]>

* Bump the day

Signed-off-by: Kemal Akkoyun <[email protected]>

* Make the Go 1.17 collector thread-safe (#969)

* Use simpler locking in the Go 1.17 collector (#975)

A previous PR made it so that the Go 1.17 collector locked only around
uses of rmSampleBuf, but really that means that Metric values may be
sent over the channel containing some values from future metrics.Read
calls. While generally-speaking this isn't a problem, we lose any
consistency guarantees provided by the runtime/metrics package.

Also, that optimization to not just lock around all of Collect was
premature. Truthfully, Collect is called relatively infrequently, and
its critical path is fairly fast (10s of µs). To prove it, this change
also adds a benchmark.

name            old time/op  new time/op  delta
GoCollector-16  43.7µs ± 2%  43.2µs ± 2%   ~     (p=0.190 n=9+9)

Note that because the benchmark is single-threaded it actually looks
like it might be getting *slightly* faster, because all those Collect
calls for the Metrics are direct calls instead of interface calls.

Signed-off-by: Michael Anthony Knyszek <[email protected]>

* API client: make http reads more efficient (#976)

Replace `io.ReadAll` with `bytes.Buffer.ReadFrom`.
Both need to resize a buffer until they have finished reading;
the former increases by 1.25x each time while the latter uses 2x.

Also added a benchmark to demonstrate the benefit:
name             old time/op    new time/op    delta
Client/4KB-8       35.9µs ± 4%    35.3µs ± 3%     ~     (p=0.310 n=5+5)
Client/50KB-8      83.1µs ± 8%    69.5µs ± 1%  -16.37%  (p=0.008 n=5+5)
Client/1000KB-8     891µs ± 6%     750µs ± 0%  -15.83%  (p=0.016 n=5+4)
Client/2000KB-8    1.74ms ± 2%    1.35ms ± 1%  -22.72%  (p=0.008 n=5+5)

name             old alloc/op   new alloc/op   delta
Client/4KB-8       20.2kB ± 0%    20.4kB ± 0%   +1.26%  (p=0.008 n=5+5)
Client/50KB-8       218kB ± 0%     136kB ± 0%  -37.65%  (p=0.008 n=5+5)
Client/1000KB-8    5.88MB ± 0%    2.11MB ± 0%  -64.10%  (p=0.008 n=5+5)
Client/2000KB-8    11.7MB ± 0%     4.2MB ± 0%  -63.93%  (p=0.008 n=5+5)

name             old allocs/op  new allocs/op  delta
Client/4KB-8         75.0 ± 0%      72.0 ± 0%   -4.00%  (p=0.008 n=5+5)
Client/50KB-8         109 ± 0%        98 ± 0%  -10.09%  (p=0.079 n=4+5)
Client/1000KB-8       617 ± 0%       593 ± 0%   -3.89%  (p=0.008 n=5+5)
Client/2000KB-8     1.13k ± 0%     1.09k ± 0%   -3.27%  (p=0.008 n=5+5)

Signed-off-by: Bryan Boreham <[email protected]>

* Reduce granularity of histogram buckets for Go 1.17 collector (#974)

The Go runtime/metrics package currently exports extremely granular
histograms. Exponentially bucket any histogram with unit "seconds"
or "bytes" instead to dramatically reduce the number of buckets, and
thus the number of metrics.

This change also adds a test to check for expected cardinality to
prevent cardinality surprises in the future.

Signed-off-by: Michael Anthony Knyszek <[email protected]>

* Cut v1.12.1 (#978)

* Cut v1.12.1

Signed-off-by: Kemal Akkoyun <[email protected]>

* Apply review suggestions

Signed-off-by: Kemal Akkoyun <[email protected]>

* Fix deprecated `NewBuildInfoCollector` API

Update `examples/random/main.go`:
  `prometheus.NewBuildInfoCollector` is deprecated. Use `collectors.NewBuildInfoCollector` instead.

Signed-off-by: alissa-tung <[email protected]>

* gocollector: Added options to Go Collector for changing the (#1031)

* Renamed files.

Signed-off-by: Bartlomiej Plotka <[email protected]>

* gocollector: Added options to Go Collector for diffetent collections.

Fixes #983

Also:

* fixed TestMemStatsEquivalence, it was noop before (:
* Removed gc_cpu_fraction metric completely, since it's not working completely for Go1.17+

Signed-off-by: Bartlomiej Plotka <[email protected]>

* gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default. (#1033)

Fixes #967

Signed-off-by: Bartlomiej Plotka <[email protected]>

* prometheus: Fix convention violating names for generated collector metrics (#1048)

* Fix convention violating names for generated collector metrics

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add new Go collector example

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove -Inf buckets from go collector histograms (#1049)

* Remove -Inf buckets from go collector histograms

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update prometheus/collectors/go_collector_latest_test.go

Co-authored-by: Bartlomiej Plotka <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>

* Simplify

Signed-off-by: Kemal Akkoyun <[email protected]>

Co-authored-by: Bartlomiej Plotka <[email protected]>

* Cut v1.12.2 (#1052)

* Cut v1.12.2

Signed-off-by: Kemal Akkoyun <[email protected]>

* Apply suggestions

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update CHANGELOG.md

Co-authored-by: Bartlomiej Plotka <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>

Co-authored-by: Bartlomiej Plotka <[email protected]>

Co-authored-by: Kemal Akkoyun <[email protected]>
Co-authored-by: Michael Knyszek <[email protected]>
Co-authored-by: Bryan Boreham <[email protected]>
Co-authored-by: Kemal Akkoyun <[email protected]>
Co-authored-by: alissa-tung <[email protected]>
Co-authored-by: Bartlomiej Plotka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v1.12.0 exposes some high cardinality metrics
5 participants