Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutting down non-leader pod starts leader jobs #1738

Closed
pleshakov opened this issue Mar 20, 2024 · 4 comments
Closed

Shutting down non-leader pod starts leader jobs #1738

pleshakov opened this issue Mar 20, 2024 · 4 comments
Assignees
Labels
bug Something isn't working tracking To track external issues or changes that will affect NKG
Milestone

Comments

@pleshakov
Copy link
Contributor

Describe the bug
If a non-leader pod gets shutdown, during shutdown for some reasons leader jobs like telemetry reporting or status updating are started.

To Reproduce

  • Deploy NGF with multiple replicas
  • Watch (kubect logs -f) logs of a non-leader pod.
  • Shutdown the pod by kubectl delete pod
  • See in the logs errors like below:
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"secret","controllerGroup":"","controllerKind":"Secret"}
. . .
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Stopping and waiting for leader election runnables"}

(the two lines below should correspond to status updater, which should not have been kicked off)

{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"statusUpdater","msg":"Writing last statuses"}
{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"statusUpdater","msg":"Updating Gateway API statuses"}
. . .

(two lines below correspond to telemetry reported, which should not have been started)

{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"telemetryJob","msg":"Starting cronjob"}
{"level":"error","ts":"2024-03-20T19:28:58Z","logger":"telemetryJob","msg":"Failed to collect telemetry data"," ...
. . .
{"level":"info","ts":"2024-03-20T19:28:58Z","logger":"telemetryJob","msg":"Stopping cronjob"}
. . .
{"level":"info","ts":"2024-03-20T19:28:58Z","msg":"Wait completed, proceeding to shutdown the manager"}

Expected behavior

leader jobs should not start during shutdown of a non-leader pod.

Your environment
NGF - edge, 5b13734

@pleshakov pleshakov added the bug Something isn't working label Mar 20, 2024
@pleshakov pleshakov added this to the v1.2.0 milestone Mar 20, 2024
@pleshakov pleshakov self-assigned this Mar 20, 2024
@pleshakov pleshakov removed this from the v1.2.0 milestone Mar 20, 2024
@salonichf5
Copy link
Contributor

logs-10-node-gradual-down-scaling.csv
logs-25-node-gradual-down-scaling.csv

Log files for gradual scale down of NGF

@pleshakov
Copy link
Contributor Author

A related issue in controller-runtime kubernetes-sigs/controller-runtime#2719

@pleshakov pleshakov removed their assignment Mar 21, 2024
@mpstefan mpstefan added the tracking To track external issues or changes that will affect NKG label Mar 25, 2024
@mpstefan mpstefan added this to the v1.3.0 milestone Mar 25, 2024
@kate-osborn
Copy link
Contributor

This can be tested once #1818 is merged

@sjberman sjberman self-assigned this Apr 15, 2024
@sjberman
Copy link
Collaborator

Confirmed that this bug is now fixed via the upstream changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tracking To track external issues or changes that will affect NKG
Projects
None yet
Development

No branches or pull requests

5 participants