-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatsD UDP publishing stops after failing to send lines #1591
Comments
Not sure if it's the same issue, but I have noticed some of our applications using Micrometer with StatsD recently have stopped publishing metrics at seemingly random moment, but all the same time. They recently seem to have been upgraded from Micrometer 1.1.5 to 1.1.6, as managed by respectively Spring Boot 2.1.7.RELEASE and 2.1.8.RELEASE. Will try downgrading to Micrometer 1.1.5 via Spring Boot's Maven property to see if this helps.
|
If I were to guess, the upgrade from reactor-netty 0.8.1 to 0.8.11 (see #1574 and #1561) is a likely candidate for the difference in behavior here. In previous patches, we never upgraded the shaded version of reactor-netty we were using. I will have to look into this more to confirm the root cause and what to do about this, though. Thank you for the thorough investigation @edwardsre and write-up of the issue. |
As a workaround, I've downgraded reactor-netty in #1613 which restores the previous behavior as far as my testing can tell. Request to affected users I'm leaving this issue open as we still want to identify the root cause and figure out where a fix needs to happen so we are not forever stuck on older versions. |
In the cases I witnesses and heard about, the application has been running for a couple of days already. Does anyone have a quicker way to reproduce/test? |
@shakuzen I heard from colleagues that they also encountered this with Micrometer 1.2.0. Is that possible and does |
@shakuzen I see Micrometer 1.1.7, 1.2.2 and 1.3.0 have been released. You say |
@shakuzen I have deployed one of our affected applications with the Thanks for leaving this open and continuing to look into the root cause in reactor-netty and netty. |
The workaround of downgrading the reactor-netty and netty dependencies is in the 1.1.7 and 1.2.2 release. 1.3.0 upgrades to the recent reactor-netty 0.9.0 minor release, and there were some compatibility issues with downgrading that, so the issue is still present in 1.3 for now. Once the root cause of this issue is identified and fixed, we'll get out a patch release with the fix. @edwardsre Thank you for trying out the release and confirming things are back to working as expected so far. |
Does not work for me with Micrometer 1.1.7 & Spring Boot 2.1.9. |
Spring Boot 2.2.0 (comes with Micrometer 1.3.0) also does not work. Could someone clarify what's the recommendation for Spring Boot users? |
Have you tried using Spring Boot 2.2.0.RELEASE and setting If not, I guess you need to stick with Spring Boot 2.1.x until this issue is fixed in Micrometer 1.3.x. |
I have investigated this more and the issue seems to be that the behavior of the reactor-netty UdpClient has changed: before reactor-netty 0.8.7 / netty 4.1.36 since reactor-netty 0.8.7 / netty 4.1.36 Note that there should be no problem as long as the statsd daemon is started and accepting lines before the application starts sending statsd lines, and the application is able to continually send lines to the daemon. If, however, the statsd daemon is not accepting the lines sent by the application, the application will stop sending lines with the new versions of netty and reactor-netty. I don't know if the change in behavior is intentional or not. Maybe it was caused by netty, and not something directly in reactor-netty. Perhaps @violetagg can speak to that. In Micrometer, we could obviate the issue by implementing recovery logic for failing to send statsd lines via UDP as was already requested for TCP in #1212. Otherwise we would need a way to get the previous behavior from the underlying reactor-netty UdpClient. |
@shakuzen Can you test 0.8.13 Reactor Netty + 4.1.43 Netty because we have changes there? |
@violetagg I have tried with 0.8.13 Reactor Netty + 4.1.43 Netty and it has the same behavior as mentioned above for "since reactor-netty 0.8.7 / netty 4.1.36". |
I have updated the title of the issue to reflect my understanding after my testing showed consistent behavior regardless of multicast. I will open a separate issue for Micrometer 1.3.1 to temporarily downgrade our shaded versions of reactor-netty / netty to 0.8.6 / 4.1.34 which have the previous behavior. I'll leave this issue open to track the long-term solution. |
Does it mean this issue has been fixed in version 1.3.1 by downgrading the netty packages? |
That is my expectation. It is more of a workaround than the long-term solution. I'm leaving this issue open until we have a long term solution that allows us to upgrade to the latest versions of dependencies. Let us know if this isn't fixed for you with 1.3.1. In my testing, it appeared to be fixed. |
Upgrades to the latest version of reactor-core and reactor-netty. Uses a `DirectProcessor` instead of a `UnicastProcessor` that can only be subscribed to once. We need to be able to subscribe multiple times for reconnecting to a server with a new `TcpClient`. For multi-threaded publishing to the `Processor`, we use a `FluxSink` rather than calling `onNext` directly. Starting and stopping the registry should now work as expected and is tested. Unlike before, there is no buffering done in the `Processor`. If the UDP/TCP client or `lineSink` cannot keep up with the rate metrics are produced at, metric lines will be dropped rather than buffered indefinitely (potentially until the application runs out of memory). A mitigation when using the UDP/TCP client is to use the `BufferingFlux` which is enabled by default - it buffers metric lines in memory up to a configurable size/time before emitting to the UDP/TCP client. As a consequence, the queue size/capacity methods and metrics (`StatsdMetrics`) are no longer available. Resolves #1212 Resolves #1591 Resolves #1676 Resolves #1741 Supersedes #1251 Co-authored-by: Johnny Lim <[email protected]>
We have updated several applications to
Spring Boot 2.1.8
and metric publishing using the StatsD meter registry has stopped. Theactuator/metrics
endpoint shows metrics being collected as expected. We publish tolocalhost
on port8125
at a1m
interval (all defaults).The
actuator/configprops
endpoint shows the metrics publication is enabled for statsd.Result summary of Boot and Micrometer versions
When running Micrometer
1.1.5
and1.1.4
, the Spring Boot java process shows an established connection on an ephemeral port sending to port 8125.netstat -uv | grep 8125 udp 0 0 localhost:53613 localhost:8125 ESTABLISHED
sudo netstat -lanp | grep 53613 udp 0 0 ::ffff:127.0.0.1:53613 ::ffff:127.0.0.1:8125 ESTABLISHED 3295/java
UDP Traffic being received with a capture of the UDP traffic using tcpdump.
Utilizing the same command when running the same application with
1.1.6
, the java process does not show any ephemeral port established and bound to 8125.After changing the logging level to
DEBUG
forio.micrometer.shaded.reactor
, I could see manyonNextDropped
debug messages output for each meter with its associated line data:Since
Operators
uses multicast, that led me to look atifconfig
and finding that multicast is not turned on for the loopback interface for our AWS EC2 instances, but it is for eth0.ifconfig - No Multicast on loopback
I turned on multicast for the loopback and restarted the Spring Boot app and metrics are now being published.
Turn on multicast
When the loopback interface is configured without multicast, metrics are published when using Micrometer
1.1.5
. Theeth0
interface has multicast turn on so I'm wondering if that interface was being selected in1.1.5
and now it isn't in1.1.6
. I haven't been able to narrow down what change in1.1.6
caused this to stop working.The text was updated successfully, but these errors were encountered: