Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure readiness probe relies on translated config #10542

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

davidjumani
Copy link

Description

When the readiness listener is defined in the bootstrap config, the pod can be considered ready before it receives any config over XDS that can lead to downtime during an upgrade. This fixes the issue by defining the readiness listener during translation, ensuring that the envoy pod will be ready only when it receives the XDS config

API changes

N/A

Context

This fixes the flaky Zero Downtime tests

Testing steps

  Gloo was successfully uninstalled.
--- PASS: TestUpgradeFromLastPatchPreviousMinor (215.47s)
    --- PASS: TestUpgradeFromLastPatchPreviousMinor/Upgrade (214.39s)
        --- PASS: TestUpgradeFromLastPatchPreviousMinor/Upgrade/TestZeroDowntimeUpgrade (214.38s)
PASS
ok  	github.com/solo-io/gloo/test/kubernetes/e2e/tests	217.020s

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
#10541

1 similar comment
@solo-changelog-bot
Copy link

Issues linked to changelog:
#10541

Copy link

github-actions bot commented Jan 2, 2025

Visit the preview URL for this PR (updated for commit b074b14):

https://gloo-edge--pr10542-fix-readiness-probe-omzmzbzp.web.app

(expires Thu, 09 Jan 2025 09:03:12 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@davidjumani davidjumani changed the title Fix readiness probe fix: Ensure readiness probe relies on translated config Jan 2, 2025
@@ -1,4 +1,4 @@
package httplisteneroptions
package httplisteneroptions_test
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes made to avoid import cycles

@@ -91,6 +91,8 @@ func (t *translator) TranslateProxy(
reporter,
)

listeners = listener.AppendHealthCheckListener(listeners)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, previously, the readiness listener is gated by $gateway.readinessProbe helm option, now that's not possible anymore? If that's the case, should any documentation be updated?

Also, the envoy admin port /ready route is also removed from the static config, is that intentional?

Copy link
Author

@davidjumani davidjumani Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helm option still generates the probe, however the HTTP endpoint is configured via translation and not the static bootstrap config
This is an internal change and user should not notice any difference. The ready route was removed since it was not used anywhere nor mentioned in the docs

Copy link

@andy-fong andy-fong Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be sure no customer is using the envoy admin /ready endpoint directly? If you already discussed this with Sam or Nathan, I am ok with that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customers use the customhealthcheck with /ready if thats the question

Copy link
Author

@davidjumani davidjumani Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nfuden. Looking at the code, I don't see a way to have a route action point to an envoy cluster (I've tried clusterHeader with a request transform / header addition). Any thoughts on how to get around this or can we have the health check on a different port ?

Copy link

@sheidkamp sheidkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, overall looks good.

}

// GenerateHealthCheckListener returns a readiness listener with the health check plugin defined on it
func GenerateHealthCheckListener() *v1.Listener {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way of validating that this generates the same config as the removed Helm? Or are the passing tests enough validation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passing tests should do. Additionally the changes to the test files (https://github.com/solo-io/gloo/pull/10542/files#diff-1b1edb892a960bc5d498be228066a8c5eaa997b92dfb62a9c297e03c6b0fe07e) show the envoy listener generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants