-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…cked as) reserved but there was a queue for it
…onsider available candidate resources that are reserved
…anted after release of hacked lock)
…isregard available candidate resources that are reserved (get notified and use them when finally released by user)
…some of the logic and implications in human language
…d getNextQueuedContext() to consider locked vs reserved resources differently
…cycle() method to cover unlock()+unreserve() and queue the un-stucking for resources that were setReservedBy() inside a lock step
…to determine candidate nuances once and maintain a more compact codebase
…cking" activities
… unstucking" activities
… survived completely (by message it echoes in the end)
…etting lock#2 to not confuse the log-parser occasionally
…ving and unreserving a resource while in lock{} closure does not let others grab it
…re that reaching stage 1-5 (after we unreserve lock#1) means that lock#2 should not have been taken yet
…s (expected bugs indeed not encountered) into test log
…seless and toxic for letting go of resources queued by someone else; switch to testing the new lr.recycle() method
…ecycle([lr]) of a reserved but not locked resource makes it available for those who wait on the queue
…requesting lock() from parallel stages added just for pressure, so they do not intervene in other tests
…of lock states between p2 and p3
…veat for recycle() - do not use from lock{} step
…nreserve inside lock{}, sleep a bit between these two operations
6 tasks
This was referenced Feb 2, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
= SUMMARY
As detailed in https://issues.jenkins.io/browse/JENKINS-67403 we had a problem with a sort of non-standard use of lockable resources, namely where we do explicitly want them remaining "not-lockable" (e.g. "locked" or "reserved") after our lock closure ends, sometimes even after the job ends. In this manner, the lockable resource is eventually un-reserved either interactively (e.g. by developer doing a post-mortem on a SUT appliance), or programmatically (later in same job, by a regular clean-up job, etc.)
We found that while this approach worked for us in most cases, sometimes it "acted weird" - especially when there were more requests for resources asked than could be served instantly (so they were queued). More details below.
I understand (though do not share) the reasoning which stalled PR #64 about exposing steps to acquire and release the locks, as opposing the declarative use of lockable-resources-plugin. IMHO, we should make a tool usable and versatile - even if it can get people to shoot themselves in the foot; this dangerous side should be documented and announced as such, but being potentially dangerous is not a blocker for merge as long as it is useful for special-needs use(r)s.
Finally, this PR is tangentially related to my earlier PR #144 which allows our developers to interactively "Reassign" to themselves a resource locked by a running test job (e.g. when they see a failure and want to investigate after test), without risk that this resource gets re-used by someone else as soon as it is unlocked (interactively or by ending the test run).
= ORIGINAL SYMPTOMS
In practice, we found that:
lock()
request even though it remained reservedInvestigation led to code, and so this PR was born.
= SOLUTION
Notable points include:
I found that the original method to checkResourcesAvailability() did not consider whether resource isReserved(), as long as it was listed among lockedResourcesAboutToBeUnlocked – and so led to immediate re-use of reserved resources IFF there was already a queued request waiting which such resource matched
the method also behaved identically for un-locking and un-reserving resources (had no way to differentiate), so that quick fix to just consider isReserved(), did not help "un-stuck" the waiting jobs when the resource was finally un-reserved as well
LockableResource methods such as unReserve(), reset() and setBuild(null) for "unlock()" effect, reasonably only changed fields of the resource instance and did not deal with LRM for the bigger picture
in TDD style, initially tests reproduced and confirmed the bugs I was hunting, both for situations where the LR (LockableResource) instances were manipulated directly, and where LRM (LockableResourcesManager) was requested to act
Tested explicitly several cases:
LRM.unreserve([LR])
, or the newLRM.recycle([LR])
orLR.recycle()
, it is immediately usable by someone from the queueNOTE: did not test, but suppose, that
recycle()
from inside the lock closure is a bad idea: if the resource becomes unlocked due to this and immediately/soon becomes used (locked) by someone else, then when the original closure completes andunlock()
is called, it might be used by a third consumer and break the work of the second one.