-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to update lease lock in fabric client with kubernetes version : v1.25.11-eks-a5565ad #5635
Comments
@premkumarramasamy there isn't enough information here to determine what you are seeing. Presumably the exception you are showing is from a debug log with the message Exception occurred while renewing lock. This is at a debug level because it is possible for multiple electors to attempt to become leaders at the same time.
What version specifically are you referring to? There have only been a couple of small changes in the leader elector in the last 6 months, so that could help narrow in. |
I'm not 100% sure whether it's exactly the same issue. But we're experiencing the same stacktrace in FLINK-34007. I found something (see my comment) that sounds like a bug when the
We tracked the issue down to a change in this PR #4125 which ended up in v6.0.0. We've experiencing this in Apache Flink 1.18 which uses fabric8io:kubernetes-client v6.6.2 The scenario in our case is that the client is usually not re-instantiated after a leadership loss. Therefore, it is stuck in its |
@XComp @csviri @premkumarramasamy it might to be good to review expected behavior: When the elector starts, it will attempt to repeatively acquire the lock. If it fails to do so there should be a debug log of "Failed to acquire lease ...", or an error log of "Exception occurred while acquiring lock ..." - ideally that shoud not always be presented as an error as it's expected to fail when there is competition. I mistakenly above thought that this would be logged at the debug level. So possible issue number 1 - is @premkumarramasamy seeing a situation where the lock seems like it should be obtainable, but is not - there's not enough information here to determine that. Once aquired, the elector will repetitively renew. If a renewal attempt ends in an exception, we'll log that at a debug level and try again - arguably that shouldn't always be logged that low. If the elector sees that some other elector has taken over, then it will stop trying to renew and send out the appropriate notifications. It will restart. Possible issue number 2 - what @XComp is mentioning sounds like there is an expectation that the the elector will resume after leadership loss. However I don't believe this was ever how it was implemented - cc @manusa |
thanks for your response @shawkins
Sorry if I didn't make myself clear enough here. Flink doesn't assume that the client resumes after the leadership is lost. One The problem is that the subsequent call of As far as I understand it, |
Ah, so you want to keep using the same instance. We haven't been explicit about whether that's expected to work, nor have we explicitly preventing things like simultaneous start calls. I think the intent behind the stopped change was to make it more like informers, which also cannot be restarted. Since there's state held on the elector and on the lock we then don't need to worry about any kind of reset logic - even if if seems unncessary or trivial in this case we still need to ensure for example that the wrong listerner event won't be called on another usage. So let's separate the ability to reuse elector instances as an enhancement request. At least for now you should be able to workaround around it by creating a new elector. |
Fair enough. We could interpret this as desired behavior. My initial perspective was that it's more like a bug because the old behavior supported the reuse of the instance. But the change happened as part of a major version update, which justifies this difference in semantics, I guess. Thanks for your clarifications. |
One other question, though: What is the semantic of the holder identity in that case? The issue with the current approach is that one could re-instantiate the LeaderElector to handle another leadership cycle but pass in the same holder identity. If the old identity is still persisted in the ConfigMap (e.g. due to network outage), the leadership cannot be acquired if the new LeaderElector instance uses the same holder identity because a leadership change is identified by a change in the holder identity (see LeaderElector:247). This sounds like a flaw in the current implementation, doesn't it? The holder identity should be owned by the LeaderElector if the instance shouldn't be reusable? ...to prevent scenarios where the identity is reused. 🤔 to be honest, that's the scenario we have in Flink right now. Changing this would require a bigger refactoring on our side. 😇 |
@XComp I may not be following. By re-instanciate I'm assuming that you mean create a new instace of the LeaderElector after the previous one has stopped. There are two cases:
The previous elector will leave things either in a state where the lock is still held (cancelled - but not released, jvm crashed), or it's no longer holding the lock. If the lock is still held by same identity, the new elector will still make the initial calls for updateObserved and associated notifications as the observedRecord starts as null. The isLeader check will succeed and it will move from the aquire loop to the renew loop. At worst the lifecycle events such as onStopLeading and onStartLeading don't differentiate the corner cases - that is onStopLeading is always called even if releaseOnCancel is false, etc. If the lock is held by a different identity, the new elector will enter the aquire loop and just keep trying as per usual.
The only difference here is in the lock for some reason is still held by the old elector. You'll have to wait for the lease renewal duration to expire before a new leader (including the new elector) can take over. If you think there's some other case, please open a new issue for that. |
I think @shawkins is right. But I still believe that |
Agreed - that needs to happen whether we pursue making it restartable or not. |
closes: fabric8io#5635 Signed-off-by: Steve Hawkins <[email protected]>
closes: #5635 Signed-off-by: Steve Hawkins <[email protected]>
Describe the bug
We are facing continous leadership change due the below issue and same was working fine before the upgrade to latest version. We have also tried deleting the object and deployed the app and the issue appears again.
There is no other/manual edits to lease object.
Fabric8 Kubernetes Client version
6.9.2
Steps to reproduce
Pods are hitting frequent leadership change due to this issue
Expected behavior
Error shouldn't happen and object should be modified
Runtime
Kubernetes (vanilla)
Kubernetes API Server version
other (please specify in additional context)
Environment
Amazon
Fabric8 Kubernetes Client Logs
Additional context
No response
The text was updated successfully, but these errors were encountered: