Event listener does not accept events after some time #687

afrittoli · 2020-07-24T09:07:29Z

Expected Behavior

I can always deliver events to an event listener

Actual Behavior

We enabled cloud events from taskruns in our dogfooding clusters. This means that every single task run and condition that runs generates three events (start, running, passed/failed), and these events are all sent to a single event listener in the dogfooding cluster.

A few of these events are selected to trigger pipelines:

when start/pass/fail events are received from a taskrun part of a CI job, the github check status is updated
when fail events are received from a taskrun part of a config map CD service, a notification is sent to slack

It happened twice in the past three weeks, since this service was enabled, that the event listener stopped accepting events.
Looking at the k8s events on one of the taskruns it shows:

Events:
  Type     Reason               Age                 From     Message
  ----     ------               ----                ----     -------
  Normal   Started              28m                 TaskRun
  Normal   Pending              28m                 TaskRun  Pending
  Normal   Pending              28m                 TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts place-tools]"
  Normal   Pending              28m                 TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-tools]"
  Normal   Pending              28m                 TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-git-source-source-sh9dr step-kubeconfig-dlvwz step-ln-dir-targetcluster-vv5r7 step-fetch-current-config step-deploy]"
  Normal   Running              28m                 TaskRun  Not all Steps in the Task have finished executing
  Normal   Succeeded            28m                 TaskRun  All Steps have completed executing
  Warning  Cloud Event Failure  28m (x13 over 28m)  TaskRun  Post http://el-tekton-events.default:8080: dial tcp 10.121.3.142:8080: connect: cannot assign requested address (10x)

The pod and service associated to the event listener do not expose any obvious issue - they appear to be healthy.
The event listener itself seems to be fine.
Deleting and recreating the event listeners solves the issue.

Unfortunately I do not have more data to share about this, I had to find a quick fix to restore CI, but perhaps the triggers team has an idea about what may be going wrong or what kind of data we may collect to catch this issue the next time it happens.

My guess is that a relatively high volume of events triggers the issue, but it's not clear in which component the issue happens exactly.

Steps to Reproduce the Problem

Additional Info

Triggers v0.5.0
Kube v0.16

The text was updated successfully, but these errors were encountered:

dibyom · 2020-07-27T17:46:09Z

dial tcp 10.121.3.142:8080: connect: cannot assign requested address (10x)

That seems to indicate that the issue is on the side of Pipelines controller trying to emit events.We might be running out of sockets on the pipelines controller if we are trying to publish too many events? Are we reusing the same connection or creating a fresh TCP connection each time?

dibyom · 2020-07-27T17:52:13Z

We could verify if we can list the number of sockets by exec'ing into the pod (likely many in TIME_WAIT states)

One possible workaound is increase the number of MaxIdleConnections to a higher number: golang/go#16012 (comment)

vdemeester · 2020-07-28T08:20:37Z

Some logs of a stuck event-listener.
2020-07-28-buggy-el-tekton-events.zip

afrittoli · 2020-07-28T13:12:18Z

We could verify if we can list the number of sockets by exec'ing into the pod (likely many in TIME_WAIT states)

I tried exec'ing into the pod, but there's nothing usable in it.
I was hoping to cat /proc/net/tcp but not even cat is available.
I tried ephemeral containers, but my cluster is 1.16 and those are disabled by default.
I'll check if by chance they are enabled on GKE.

One possible workaound is increase the number of MaxIdleConnections to a higher number: golang/go#16012 (comment)

dibyom · 2020-07-28T14:52:25Z

I took a look at the logs for the EL. Nothing really stood out. Based on the Cloud Event Failure event above, I still suspect the problem is with the pipelines controller trying to create too many http connections in a short time period. ~~Any chance we have the logs for the controller?~~ actually, I can just dig through Stackdriver for logs...do we have a rough time period when this happened?

bigkevmcd · 2020-07-30T12:23:03Z

I've seen this before, when failing to close a request body (running out of sockets).

I'll take a look and see if that could be the case.

afrittoli · 2020-08-12T08:25:57Z

Looking at the pod, I can see that it was restarted 67 times in almost 8 days:

el-tekton-events-6ccf68f59d-7z5fp                                 1/1     Running     67         7d22h

    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 11 Aug 2020 18:49:53 +0100
      Finished:     Tue, 11 Aug 2020 18:50:00 +0100
    Ready:          True
    Restart Count:  67

The event listener was not working at times in the past few days, so I'm guessing that those restarts fixed the issue since it is working now.

afrittoli · 2020-08-12T08:37:04Z

Alternatively it could also be that at busy time the pod takes to long to answer the health-checks, and thus is restarted, which causes lost messages. I wonder if this could lead to connection left open on the source too.

dibyom · 2020-08-13T22:02:21Z

Looks like the EL memory is going up until it crashes/get restarted. This only seems to happen for el-tekton-events. The other ELs are steady in terms of memory usage. Each increasing line is one el-tekton-events- container before it gets restarted:

Link if you have view access to the Dogfooding GCP project: https://console.cloud.google.com/monitoring/dashboards/custom/bbc4dab5-e91c-491f-893a-45759414ee3a?project=tekton-releases&timeDomain=1m

vdemeester · 2020-08-25T15:36:32Z

cc @n3wscott as it "may" be related to the cloudevents sdk code ?

n3wscott · 2020-08-25T22:15:38Z

You can change the sender timeout but I have an open issue to control the receiver timeout. Let me look into it tomorrow.

dibyom · 2020-09-09T15:51:34Z

I was able to somewhat repro this -> I setup a pipelines controller to send cloud events to el-tekton-events in my controller. I used a modified version of the sink image so I could get a shell. Some findings:

Each cloud event call seems to create a new TCP connection.
These connections stay established forever! I ran a bunch of TaskTun calls yesterday -> all of those connections were still in the established state this morning:

root@el-tekton-events-75b55c5888-pqwmw:/go# ss -s
Total: 1714
TCP:   1799 (estab 1712, closed 86, orphaned 0, timewait 10)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       0         0         0
TCP       1713      4         1709
INET      1713      4         1709
FRAG      0         0         0

I could also see the memory footprint steadily increasing -> though the EL did not stall
I instrumented the code and added some profiling though I have yet to go through the flame graphs

Next Steps

For Triggers: We should close Idle connections. @khrm has Add Timeout for EventListener Server #747 open for adding some
For Pipelines: We should look into if cloud events needs to open a new connection each time and also close idle connections after a reasonable amount of time.

afrittoli · 2020-09-09T20:50:34Z

Cool! I started looking into https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/cloudevent/cloudeventclient.go#L38 - that's where we create a new HTTP client every time, which comes with its own transport that I think by default reuses connection so it keeps them open.
We could try to re-use the same client across reconcile cycles... or we could change the transport configuration to close connections.

dibyom · 2020-09-09T21:18:47Z

We could try to re-use the same client across reconcile cycles... or we could change the transport configuration to close connections.

Reusing the same client should help I think?

Also, it seems like the high memory usage in Triggers was due to the large num of established connections. If I delete/restart the pipeline controller, the number of connections drops immediately and so does the memory usage:

afrittoli · 2020-09-10T11:55:18Z

I did some more tests with the pipeline controller, and I can confirm that the HTTP client is only created once.
However the original HTTP client does not have the target set (it is set in https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/event.go#L55), so I suspect the client will create a new transport each time Emit is invoked.

Previously, we were never closing Idle connections leading to issues described in tektoncd#687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <[email protected]>

Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <[email protected]>

dibyom · 2020-09-22T15:52:04Z

tektoncd/pipeline#3201 and #755 fixes the immediate issue. And we have #747 to add in some tunable timeouts.

/close

tekton-robot · 2020-09-22T15:52:07Z

@dibyom: Closing this issue.

In response to this:

tektoncd/pipeline#3201 and #755 fixes the immediate issue. And we have #747 to add in some tunable timeouts.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dibyom added the kind/question Issues or PRs that are questions around the project or a particular feature label Jul 27, 2020

dibyom added this to the Triggers v0.8.0 milestone Aug 11, 2020

dibyom mentioned this issue Sep 1, 2020

FR: Metrics for EventListener #736

Closed

khrm mentioned this issue Sep 8, 2020

Add Timeout for EventListener Server #747

Merged

3 tasks

dibyom added kind/bug Categorizes issue or PR as related to a bug. and removed kind/question Issues or PRs that are questions around the project or a particular feature labels Sep 9, 2020

afrittoli mentioned this issue Sep 10, 2020

Cloud Events delivery leaks open connections tektoncd/pipeline#3190

Closed

dibyom mentioned this issue Sep 10, 2020

Idle timeout #754

Closed

3 tasks

dibyom mentioned this issue Sep 10, 2020

Add a IdleTimeout for EL sink #755

Merged

3 tasks

tekton-robot closed this as completed Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event listener does not accept events after some time #687

Event listener does not accept events after some time #687

afrittoli commented Jul 24, 2020 •

edited

Loading

dibyom commented Jul 27, 2020

dibyom commented Jul 27, 2020

vdemeester commented Jul 28, 2020

afrittoli commented Jul 28, 2020

dibyom commented Jul 28, 2020 •

edited

Loading

bigkevmcd commented Jul 30, 2020

afrittoli commented Aug 12, 2020 •

edited

Loading

afrittoli commented Aug 12, 2020

dibyom commented Aug 13, 2020

vdemeester commented Aug 25, 2020

n3wscott commented Aug 25, 2020

dibyom commented Sep 9, 2020

afrittoli commented Sep 9, 2020

dibyom commented Sep 9, 2020

afrittoli commented Sep 10, 2020

dibyom commented Sep 22, 2020

tekton-robot commented Sep 22, 2020

Event listener does not accept events after some time #687

Event listener does not accept events after some time #687

Comments

afrittoli commented Jul 24, 2020 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

dibyom commented Jul 27, 2020

dibyom commented Jul 27, 2020

vdemeester commented Jul 28, 2020

afrittoli commented Jul 28, 2020

dibyom commented Jul 28, 2020 • edited Loading

bigkevmcd commented Jul 30, 2020

afrittoli commented Aug 12, 2020 • edited Loading

afrittoli commented Aug 12, 2020

dibyom commented Aug 13, 2020

vdemeester commented Aug 25, 2020

n3wscott commented Aug 25, 2020

dibyom commented Sep 9, 2020

Next Steps

afrittoli commented Sep 9, 2020

dibyom commented Sep 9, 2020

afrittoli commented Sep 10, 2020

dibyom commented Sep 22, 2020

tekton-robot commented Sep 22, 2020

afrittoli commented Jul 24, 2020 •

edited

Loading

dibyom commented Jul 28, 2020 •

edited

Loading

afrittoli commented Aug 12, 2020 •

edited

Loading