-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event listener does not accept events after some time #687
Comments
That seems to indicate that the issue is on the side of Pipelines controller trying to emit events.We might be running out of sockets on the pipelines controller if we are trying to publish too many events? Are we reusing the same connection or creating a fresh TCP connection each time? |
We could verify if we can list the number of sockets by exec'ing into the pod (likely many in TIME_WAIT states) One possible workaound is increase the number of MaxIdleConnections to a higher number: golang/go#16012 (comment) |
Some logs of a stuck event-listener. |
I tried exec'ing into the pod, but there's nothing usable in it.
|
I took a look at the logs for the EL. Nothing really stood out. Based on the Cloud Event Failure event above, I still suspect the problem is with the pipelines controller trying to create too many http connections in a short time period. |
I've seen this before, when failing to close a request body (running out of sockets). I'll take a look and see if that could be the case. |
Looking at the pod, I can see that it was restarted 67 times in almost 8 days:
The event listener was not working at times in the past few days, so I'm guessing that those restarts fixed the issue since it is working now. |
Alternatively it could also be that at busy time the pod takes to long to answer the health-checks, and thus is restarted, which causes lost messages. I wonder if this could lead to connection left open on the source too. |
Looks like the EL memory is going up until it crashes/get restarted. This only seems to happen for Link if you have view access to the Dogfooding GCP project: https://console.cloud.google.com/monitoring/dashboards/custom/bbc4dab5-e91c-491f-893a-45759414ee3a?project=tekton-releases&timeDomain=1m |
cc @n3wscott as it "may" be related to the cloudevents sdk code ? |
You can change the sender timeout but I have an open issue to control the receiver timeout. Let me look into it tomorrow. |
I was able to somewhat repro this -> I setup a pipelines controller to send cloud events to el-tekton-events in my controller. I used a modified version of the sink image so I could get a shell. Some findings:
Next Steps
|
Cool! I started looking into https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/cloudevent/cloudeventclient.go#L38 - that's where we create a new HTTP client every time, which comes with its own transport that I think by default reuses connection so it keeps them open. |
Reusing the same client should help I think? Also, it seems like the high memory usage in Triggers was due to the large num of established connections. If I delete/restart the pipeline controller, the number of connections drops immediately and so does the memory usage: |
I did some more tests with the pipeline controller, and I can confirm that the HTTP client is only created once. |
Previously, we were never closing Idle connections leading to issues described in tektoncd#687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <[email protected]>
Previously, we were never closing Idle connections leading to issues described in tektoncd#687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <[email protected]>
Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <[email protected]>
tektoncd/pipeline#3201 and #755 fixes the immediate issue. And we have #747 to add in some tunable timeouts. /close |
@dibyom: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Expected Behavior
I can always deliver events to an event listener
Actual Behavior
We enabled cloud events from taskruns in our dogfooding clusters. This means that every single task run and condition that runs generates three events (start, running, passed/failed), and these events are all sent to a single event listener in the dogfooding cluster.
A few of these events are selected to trigger pipelines:
It happened twice in the past three weeks, since this service was enabled, that the event listener stopped accepting events.
Looking at the k8s events on one of the taskruns it shows:
The pod and service associated to the event listener do not expose any obvious issue - they appear to be healthy.
The event listener itself seems to be fine.
Deleting and recreating the event listeners solves the issue.
Unfortunately I do not have more data to share about this, I had to find a quick fix to restore CI, but perhaps the triggers team has an idea about what may be going wrong or what kind of data we may collect to catch this issue the next time it happens.
My guess is that a relatively high volume of events triggers the issue, but it's not clear in which component the issue happens exactly.
Steps to Reproduce the Problem
Additional Info
Triggers v0.5.0
Kube v0.16
The text was updated successfully, but these errors were encountered: