Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics stop updating after k8s agent is redeployed. #64

Closed
tcrozier opened this issue Jul 24, 2018 · 13 comments
Closed

Metrics stop updating after k8s agent is redeployed. #64

tcrozier opened this issue Jul 24, 2018 · 13 comments

Comments

@tcrozier
Copy link

tcrozier commented Jul 24, 2018

I am running a .net core client using dogstatsd-csharp-client to send custom metrics to a V6 datadog agent . The client is running in a kubernetes pod talking to the agent which is running on the same node. When everything is deployed the metrics are transmitted and received correctly. However, if I redeploy the datadog agent the metrics are no longer received. I can however run the following command from the pod the client is running on and still get custom metrics.

  • echo -n "custom_metric:50|g|#shell" >/dev/udp/172.21.97.95/8125

If I redeploy the client afterwards everything starts to communicate correctly once again.

@brantburnett
Copy link

Based on symptoms I've seen, I feel like this is a problem with the C# client not reconnecting if the DataDog agent is recycled.

Deploying a new version of the agent causes the agent to go down and back up. After this, stats are no longer delivered from the application. However, manually delivered stats still work (via command line mentioned above). Additionally, recycling the application causes stats to start delivering again.

@brantburnett
Copy link

To clarify a bit further, we're sending custom metrics via the DogStatsD server included with the DataDog agent, following these steps revolving around "hostPort": https://www.datadoghq.com/blog/monitor-kubernetes-docker/#bind-the-dogstatsd-port-to-a-host-port

When we upgrade the DD agent in the K8S DaemonSet, it kills the pod on each node and replaces it with a new pod. This is when the problem occurs, and we have to recycle our application pods before custom metrics resume flowing.

I've attempted to reproduce this issue in plain Docker on my local machine without success. Which makes me suspect that it might be at least partially related to some detail of Kubernetes networking.

@brantburnett
Copy link

So I think I've tracked this down to the way Kubernetes (in this case on AWS with Calico CNI networking layer) is handling UDP connections.

  • When the DD Agent pod is launched as hostPort, Kubernetes registers a NAT rule in iptables on the host to forward port 8125 to the pod
  • When the client app sends a UDP packet to the DogStatsD agent via the host IP, a UDP session is tracked in NAT on the host, based on source and dest IP and port
  • You can see this session on the K8S host using sudo cat /proc/net/ip_conntrack | grep "dport=8125"
  • This session has a lifetime of approximately 30 seconds (seems to be the default)
  • If the DD Agent pod is killed and recreated, i.e. during DaemonSet upgrade or via kubectl delete pod, it acquires a new IP
  • Any further traffic from the same source IP and port continues to be routed to the old IP so long as the session stays active (packets < every 30 seconds), therefore never reaching the new pod

The dogstatsd-csharp-client maintains a single .Net Socket class for the lifetime of the application. Therefore, the source port is always constant, thus triggering this situation so long as you keep trying to send stats.

I was able to replicate this in my Kubernetes cluster using these steps on the command line:

  1. kubectl create ns test to create a test namespace
  2. kubectl annotate ns test scheduler.alpha.kubernetes.io/node-selector=test=test restricts all pods in this namespace to nodes with the "test=test" label
  3. kubectl label node ip-172-21-98-62.ec2.internal test=test to label a single node in my cluster (running DD agent) so we can test on just this node
  4. kubectl run -n test cmdline --rm -it --image=ubuntu --restart=Never to launch an instance of Ubuntu on that node
  5. apt-get update && apt-get install -y netcat to install netcat on the instance of Ubuntu
  6. send a stat every 3 seconds 100 times. Note the IP is the host node IP address.
    for i in `seq 1 100`; do echo "custom_metric:50|g|#shell\n" | nc -u -p 40001 -q 1 172.21.98.62 8125; sleep 2; done
  7. After you see stats arriving, kubectl delete pod dd-agent-abcdef to delete the DD agent on the node being used (in a separate terminal so that the loop above is still running)
  8. Notice that stats have stopped arriving
  9. To start stats arriving again, cancel the loop and restart but change the source port from 40001 to 40002, or wait at least 30 seconds before restarting the loop

I think the solution may be to occasionally dispose and recreate the Socket in the .NET SDK. However, that still wouldn't prevent the loss of some statistics. And since UDP sockets are stateless, I'm not sure how to detect the situation and fix it sooner. @crhairr suggested that it might be possible to use QoS to detect that a packet wasn't delivered, but I'm not sure about this idea's performance implications, difficulty, or compatibility with Kubernetes.

@truthbk
Copy link
Member

truthbk commented Aug 13, 2018

@brantburnett did you try killing the conntrack entry with conntrack -D -p udp -s <srcIP> -d <dstIP> --sport <origSrcPort> --dport 8125 (or similar)?

This might be a calico bug more than anything else.

@brantburnett
Copy link

@truthbk I did not. I would assume this would work, based on the symptoms, but it's not something I specifically tested. I'm traveling this week, so I doubt I'll have time to look at it, but I'll let you know once I try it.

@brantburnett
Copy link

@truthbk

I can confirm that the conntrack command line you provided does indeed correct the problem (once I apt-get update && apt-get install -y conntrack on the Kubernetes host node).

Unfortunately, I'm not seeing any error logs in Calico indicating a failure:

2018-08-15 22:32:24.460 [INFO][105] calc_graph.go 211: Local endpoint deleted id=WorkloadEndpoint(node=ip-172-21-98-62, orchestrator=k8s, workload=default.datadog-agent-s5rmm, name=eth0)
2018-08-15 22:32:24.460 [INFO][105] int_dataplane.go 574: Received *proto.WorkloadEndpointRemove update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"default.datadog-agent-s5rmm" endpoint_id:"eth0" >
2018-08-15 22:32:24.461 [INFO][105] int_dataplane.go 574: Received *proto.ActiveProfileRemove update from calculation graph msg=id:<name:"k8s_ns.default" >
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-pri-k8s_ns.default" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="chain removal" table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-pro-k8s_ns.default" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] int_dataplane.go 690: Applying dataplane updates
2018-08-15 22:32:24.461 [INFO][105] endpoint_mgr.go 465: Workload removed, deleting its chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-tw-calice46a2b65f1" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-fw-calice46a2b65f1" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] endpoint_mgr.go 470: Workload removed, deleting old state. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] table.go 398: Queueing update of chain. chainName="cali-from-wl-dispatch" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 398: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] endpoint_mgr.go 331: Re-evaluated workload endpoint status adminUp=false failed=false known=false operUp=false status="" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] route_table.go 447: Syncing routes: removing old route. dest=100.119.52.73/32 ifaceName="calice46a2b65f1" ipVersion=0x4
2018-08-15 22:32:24.461 [INFO][105] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] conntrack.go 78: Removing conntrack flows ip=100.119.52.73
2018-08-15 22:32:24.470 [INFO][105] status_combiner.go 86: Reporting endpoint removed. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}

Interestingly, Calico seems to use the conntrack command line utility to handle some of this cleanup: https://github.com/projectcalico/felix/blob/3d43b13f7cdb9e42bb98e828241dcd30a28ace74/conntrack/conntrack.go#L83. However, it appears to be based on deleting traffic routed to the IP address of the deleted DataDog pod. When I look at the session table before deleting, it looks like this:

udp      17 29 src=100.119.52.74 dst=172.21.98.62 sport=40001 dport=8125 [UNREPLIED] src=100.119.52.70 dst=100.119.52.74 sport=8125 dport=40001 mark=0 use=2

After deleting, it looks like this:

udp      17 29 src=100.119.52.74 dst=172.21.98.62 sport=40001 dport=8125 [UNREPLIED] src=172.21.98.62 dst=100.119.52.74 sport=8125 dport=40001 mark=0 use=2

Note that the IP changed from the IP of the old DataDog pod to the IP of the actual host node. Here's my revised theory:

  1. The old DataDog pod is deleted
  2. Calico deletes the session correctly routing traffic to the old pod
  3. A new packet is sent by an application pod, as expected being routed to the host IP address
  4. The new DataDog pod is not up and running with it's hostPort yet, so a new session is created by Linux that simply goes to the host IP without NAT
  5. The new DataDog pod comes up and is assigned its hostPort and NAT routes are created by Calico
  6. Traffic is continued to route to the host IP only, without NAT forwarding, because of the existing session

I think, based on this, that Calico needs to add a handling step when a hostPort is registered that clears any sessions running to the host IP (not NAT'd) on that port number. Just a theory, anyway.

@truthbk
Copy link
Member

truthbk commented Aug 20, 2018

@brantburnett this is absolutely awesome and thorough, I really appreciate you taking the time to not only reproduce and debug, but also run that little cleanup experiment for us. It is, as I suspected, more of a CNI layer (calico) issue, more than anything on the library side - UDP being stateless/connectionless (in the general accepted sense for connection) gives us very few options to detect a failure on the other end, other than perhaps inspecting all ICMP responses which would not typically be something a client would do, and even then it would still be unreliable. TCP would be a different story.

I tend to agree with your theory, especially in the light of all the information you've provided. The only doubt I have there, is why would the host add tracking to a port that is not being listened to on the host. That's a bit surprising, but might be happening.

This is probably worth reporting with Calico, though it's not an easy problem to solve because it's hard to tell if those UDP packets in the race for the new pod to come up are actually for the host or for the container. 🤔 I can see Calico arguing it's not something the CNI should handle.

Regardless, thank you very much for all the time you've put to help us debug this.

@brantburnett
Copy link

I've filed an issue with Calico: https://github.com/projectcalico/felix/issues/1880, including a pure command line reproduction using netcat.

@brantburnett
Copy link

I've also found this command line workaround using kubectl. It must be run after all DataDog pods are completely replaced, and assumes you are using Calico as your CNI. Also, it's in PowerShell (sorry Bash people), but should be convertible.

kubectl get pod -n kube-system -l k8s-app=calico-node -o name | % {kubectl exec -n kube-system $_.Split('/')[1] -c calico-node -- conntrack -D -p udp --dport 8125}

It deletes all open UDP sessions running to port 8125 on every node in the Kubernetes cluster by running conntrack on each Calico pod.

@brantburnett
Copy link

Upon further research, this appears to not be an issue with the Calico CNI plugin but instead an issue with the portmap CNI plugin. I've reported it there.

containernetworking/plugins#123

@truthbk
Copy link
Member

truthbk commented Sep 14, 2018

I'm going to close this issue as I believe it's not really an issue on our side. Thank you so much for the investigation and help @brantburnett.

Please feel free to ping us if you believe it should be re-opened.

@Diggsey
Copy link

Diggsey commented Jan 3, 2019

For people on google cloud, this command works:

kubectl get pod -n kube-system -l component=kube-proxy -o name | cut -c5- | xargs -I{} kubectl exec -n kube-system {} -- conntrack -D -p udp --dport 8125

@wyardley
Copy link

We have this issue via other clients, but I wonder, how does this deal with the case where a container crashes / respawns other than from an intentional redeploy?

I get that udp is stateless, but why not have a finite lifetime after which the socket is closed and re-opened (say every couple of minutes) so that you don't end up in a case where the client is sending metrics to nowhere indefinitely?

Someone mentioned that the Ruby client may check the agent lifetime directly, but I haven't investigated it too deeply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants