Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event listener does not accept events after some time #687

Closed
afrittoli opened this issue Jul 24, 2020 · 17 comments
Closed

Event listener does not accept events after some time #687

afrittoli opened this issue Jul 24, 2020 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@afrittoli
Copy link
Member

afrittoli commented Jul 24, 2020

Expected Behavior

I can always deliver events to an event listener

Actual Behavior

We enabled cloud events from taskruns in our dogfooding clusters. This means that every single task run and condition that runs generates three events (start, running, passed/failed), and these events are all sent to a single event listener in the dogfooding cluster.

A few of these events are selected to trigger pipelines:

  • when start/pass/fail events are received from a taskrun part of a CI job, the github check status is updated
  • when fail events are received from a taskrun part of a config map CD service, a notification is sent to slack

It happened twice in the past three weeks, since this service was enabled, that the event listener stopped accepting events.
Looking at the k8s events on one of the taskruns it shows:

Events:
  Type     Reason               Age                 From     Message
  ----     ------               ----                ----     -------
  Normal   Started              28m                 TaskRun
  Normal   Pending              28m                 TaskRun  Pending
  Normal   Pending              28m                 TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts place-tools]"
  Normal   Pending              28m                 TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-tools]"
  Normal   Pending              28m                 TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-git-source-source-sh9dr step-kubeconfig-dlvwz step-ln-dir-targetcluster-vv5r7 step-fetch-current-config step-deploy]"
  Normal   Running              28m                 TaskRun  Not all Steps in the Task have finished executing
  Normal   Succeeded            28m                 TaskRun  All Steps have completed executing
  Warning  Cloud Event Failure  28m (x13 over 28m)  TaskRun  Post http://el-tekton-events.default:8080: dial tcp 10.121.3.142:8080: connect: cannot assign requested address (10x)

The pod and service associated to the event listener do not expose any obvious issue - they appear to be healthy.
The event listener itself seems to be fine.
Deleting and recreating the event listeners solves the issue.

Unfortunately I do not have more data to share about this, I had to find a quick fix to restore CI, but perhaps the triggers team has an idea about what may be going wrong or what kind of data we may collect to catch this issue the next time it happens.

My guess is that a relatively high volume of events triggers the issue, but it's not clear in which component the issue happens exactly.

Steps to Reproduce the Problem

Additional Info

Triggers v0.5.0
Kube v0.16

@dibyom
Copy link
Member

dibyom commented Jul 27, 2020

dial tcp 10.121.3.142:8080: connect: cannot assign requested address (10x)

That seems to indicate that the issue is on the side of Pipelines controller trying to emit events.We might be running out of sockets on the pipelines controller if we are trying to publish too many events? Are we reusing the same connection or creating a fresh TCP connection each time?

@dibyom
Copy link
Member

dibyom commented Jul 27, 2020

We could verify if we can list the number of sockets by exec'ing into the pod (likely many in TIME_WAIT states)

One possible workaound is increase the number of MaxIdleConnections to a higher number: golang/go#16012 (comment)

@dibyom dibyom added the kind/question Issues or PRs that are questions around the project or a particular feature label Jul 27, 2020
@vdemeester
Copy link
Member

Some logs of a stuck event-listener.
2020-07-28-buggy-el-tekton-events.zip

@afrittoli
Copy link
Member Author

We could verify if we can list the number of sockets by exec'ing into the pod (likely many in TIME_WAIT states)

I tried exec'ing into the pod, but there's nothing usable in it.
I was hoping to cat /proc/net/tcp but not even cat is available.
I tried ephemeral containers, but my cluster is 1.16 and those are disabled by default.
I'll check if by chance they are enabled on GKE.

One possible workaound is increase the number of MaxIdleConnections to a higher number: golang/go#16012 (comment)

@dibyom
Copy link
Member

dibyom commented Jul 28, 2020

I took a look at the logs for the EL. Nothing really stood out. Based on the Cloud Event Failure event above, I still suspect the problem is with the pipelines controller trying to create too many http connections in a short time period. Any chance we have the logs for the controller? actually, I can just dig through Stackdriver for logs...do we have a rough time period when this happened?

@bigkevmcd
Copy link
Member

I've seen this before, when failing to close a request body (running out of sockets).

I'll take a look and see if that could be the case.

@dibyom dibyom added this to the Triggers v0.8.0 milestone Aug 11, 2020
@afrittoli
Copy link
Member Author

afrittoli commented Aug 12, 2020

Looking at the pod, I can see that it was restarted 67 times in almost 8 days:

el-tekton-events-6ccf68f59d-7z5fp                                 1/1     Running     67         7d22h
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 11 Aug 2020 18:49:53 +0100
      Finished:     Tue, 11 Aug 2020 18:50:00 +0100
    Ready:          True
    Restart Count:  67

The event listener was not working at times in the past few days, so I'm guessing that those restarts fixed the issue since it is working now.

@afrittoli
Copy link
Member Author

Alternatively it could also be that at busy time the pod takes to long to answer the health-checks, and thus is restarted, which causes lost messages. I wonder if this could lead to connection left open on the source too.

@dibyom
Copy link
Member

dibyom commented Aug 13, 2020

Looks like the EL memory is going up until it crashes/get restarted. This only seems to happen for el-tekton-events. The other ELs are steady in terms of memory usage. Each increasing line is one el-tekton-events- container before it gets restarted:

image

Link if you have view access to the Dogfooding GCP project: https://console.cloud.google.com/monitoring/dashboards/custom/bbc4dab5-e91c-491f-893a-45759414ee3a?project=tekton-releases&timeDomain=1m

@vdemeester
Copy link
Member

cc @n3wscott as it "may" be related to the cloudevents sdk code ?

@n3wscott
Copy link

You can change the sender timeout but I have an open issue to control the receiver timeout. Let me look into it tomorrow.

@dibyom
Copy link
Member

dibyom commented Sep 9, 2020

I was able to somewhat repro this -> I setup a pipelines controller to send cloud events to el-tekton-events in my controller. I used a modified version of the sink image so I could get a shell. Some findings:

  1. Each cloud event call seems to create a new TCP connection.

  2. These connections stay established forever! I ran a bunch of TaskTun calls yesterday -> all of those connections were still in the established state this morning:

root@el-tekton-events-75b55c5888-pqwmw:/go# ss -s
Total: 1714
TCP:   1799 (estab 1712, closed 86, orphaned 0, timewait 10)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       0         0         0
TCP       1713      4         1709
INET      1713      4         1709
FRAG      0         0         0
  1. I could also see the memory footprint steadily increasing -> though the EL did not stall

  2. I instrumented the code and added some profiling though I have yet to go through the flame graphs

Next Steps

  1. For Triggers: We should close Idle connections. @khrm has Add Timeout for EventListener Server #747 open for adding some
  2. For Pipelines: We should look into if cloud events needs to open a new connection each time and also close idle connections after a reasonable amount of time.

@afrittoli
Copy link
Member Author

Cool! I started looking into https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/cloudevent/cloudeventclient.go#L38 - that's where we create a new HTTP client every time, which comes with its own transport that I think by default reuses connection so it keeps them open.
We could try to re-use the same client across reconcile cycles... or we could change the transport configuration to close connections.

@dibyom dibyom added kind/bug Categorizes issue or PR as related to a bug. and removed kind/question Issues or PRs that are questions around the project or a particular feature labels Sep 9, 2020
@dibyom
Copy link
Member

dibyom commented Sep 9, 2020

We could try to re-use the same client across reconcile cycles... or we could change the transport configuration to close connections.

Reusing the same client should help I think?

Also, it seems like the high memory usage in Triggers was due to the large num of established connections. If I delete/restart the pipeline controller, the number of connections drops immediately and so does the memory usage:

image

@afrittoli
Copy link
Member Author

I did some more tests with the pipeline controller, and I can confirm that the HTTP client is only created once.
However the original HTTP client does not have the target set (it is set in https://github.com/tektoncd/pipeline/blob/386637407f6715750dd643a5c740ecd9b2380b7e/pkg/reconciler/events/event.go#L55), so I suspect the client will create a new transport each time Emit is invoked.

dibyom added a commit to dibyom/triggers that referenced this issue Sep 10, 2020
Previously, we were never closing Idle connections leading to issues described
in tektoncd#687.  This commit adds a fixed 2 minute timeout for idle connections though
later we can also add other timeouts as well as allow for users to change the
timeout values.

I verified this manually by building on a base image with a shell and then
verifying that the number of open connections eventually go down unlike before.

Signed-off-by: Dibyo Mukherjee <[email protected]>
@dibyom dibyom mentioned this issue Sep 10, 2020
3 tasks
dibyom added a commit to dibyom/triggers that referenced this issue Sep 10, 2020
Previously, we were never closing Idle connections leading to issues described
in tektoncd#687.  This commit adds a fixed 2 minute timeout for idle connections though
later we can also add other timeouts as well as allow for users to change the
timeout values.

I verified this manually by building on a base image with a shell and then
verifying that the number of open connections eventually go down unlike before.

Signed-off-by: Dibyo Mukherjee <[email protected]>
tekton-robot pushed a commit that referenced this issue Sep 11, 2020
Previously, we were never closing Idle connections leading to issues described
in #687.  This commit adds a fixed 2 minute timeout for idle connections though
later we can also add other timeouts as well as allow for users to change the
timeout values.

I verified this manually by building on a base image with a shell and then
verifying that the number of open connections eventually go down unlike before.

Signed-off-by: Dibyo Mukherjee <[email protected]>
@dibyom
Copy link
Member

dibyom commented Sep 22, 2020

tektoncd/pipeline#3201 and #755 fixes the immediate issue. And we have #747 to add in some tunable timeouts.

/close

@tekton-robot
Copy link

@dibyom: Closing this issue.

In response to this:

tektoncd/pipeline#3201 and #755 fixes the immediate issue. And we have #747 to add in some tunable timeouts.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants