Shutting down non-leader pod starts leader jobs #1738

pleshakov · 2024-03-20T19:35:07Z

Describe the bug
If a non-leader pod gets shutdown, during shutdown for some reasons leader jobs like telemetry reporting or status updating are started.

To Reproduce

Deploy NGF with multiple replicas
Watch (kubect logs -f) logs of a non-leader pod.
Shutdown the pod by kubectl delete pod
See in the logs errors like below:

{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"secret","controllerGroup":"","controllerKind":"Secret"}
. . .
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Stopping and waiting for leader election runnables"}

(the two lines below should correspond to status updater, which should not have been kicked off)

{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"statusUpdater","msg":"Writing last statuses"}
{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"statusUpdater","msg":"Updating Gateway API statuses"}
. . .

(two lines below correspond to telemetry reported, which should not have been started)

{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"telemetryJob","msg":"Starting cronjob"}
{"level":"error","ts":"2024-03-20T19:28:58Z","logger":"telemetryJob","msg":"Failed to collect telemetry data"," ...
. . .
{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"telemetryJob","msg":"Stopping cronjob"}
. . .
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Wait completed, proceeding to shutdown the manager"}

Expected behavior

leader jobs should not start during shutdown of a non-leader pod.

Your environment
NGF - edge, 5b13734

The text was updated successfully, but these errors were encountered:

salonichf5 · 2024-03-20T19:58:22Z

logs-10-node-gradual-down-scaling.csv
logs-25-node-gradual-down-scaling.csv

Log files for gradual scale down of NGF

pleshakov · 2024-03-21T13:32:29Z

A related issue in controller-runtime kubernetes-sigs/controller-runtime#2719

kate-osborn · 2024-04-11T17:06:29Z

This can be tested once #1818 is merged

sjberman · 2024-04-15T14:57:41Z

Confirmed that this bug is now fixed via the upstream changes.

pleshakov added the bug Something isn't working label Mar 20, 2024

pleshakov added this to the v1.2.0 milestone Mar 20, 2024

pleshakov self-assigned this Mar 20, 2024

pleshakov removed this from the v1.2.0 milestone Mar 20, 2024

pleshakov mentioned this issue Mar 20, 2024

Add test results for zero downtime scaling #1733

Merged

6 tasks

bjee19 mentioned this issue Mar 21, 2024

Release 1.2.0 #1742

Merged

pleshakov removed their assignment Mar 21, 2024

mpstefan added the tracking To track external issues or changes that will affect NKG label Mar 25, 2024

mpstefan added this to the v1.3.0 milestone Mar 25, 2024

sjberman self-assigned this Apr 15, 2024

sjberman closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutting down non-leader pod starts leader jobs #1738

Shutting down non-leader pod starts leader jobs #1738

pleshakov commented Mar 20, 2024

salonichf5 commented Mar 20, 2024

pleshakov commented Mar 21, 2024

kate-osborn commented Apr 11, 2024

sjberman commented Apr 15, 2024

Shutting down non-leader pod starts leader jobs #1738

Shutting down non-leader pod starts leader jobs #1738

Comments

pleshakov commented Mar 20, 2024

salonichf5 commented Mar 20, 2024

pleshakov commented Mar 21, 2024

kate-osborn commented Apr 11, 2024

sjberman commented Apr 15, 2024