Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279

Merged
merged 22 commits into from
Feb 1, 2022

Conversation

jimklimov
Copy link
Contributor

= SUMMARY

As detailed in https://issues.jenkins.io/browse/JENKINS-67403 we had a problem with a sort of non-standard use of lockable resources, namely where we do explicitly want them remaining "not-lockable" (e.g. "locked" or "reserved") after our lock closure ends, sometimes even after the job ends. In this manner, the lockable resource is eventually un-reserved either interactively (e.g. by developer doing a post-mortem on a SUT appliance), or programmatically (later in same job, by a regular clean-up job, etc.)

We found that while this approach worked for us in most cases, sometimes it "acted weird" - especially when there were more requests for resources asked than could be served instantly (so they were queued). More details below.

I understand (though do not share) the reasoning which stalled PR #64 about exposing steps to acquire and release the locks, as opposing the declarative use of lockable-resources-plugin. IMHO, we should make a tool usable and versatile - even if it can get people to shoot themselves in the foot; this dangerous side should be documented and announced as such, but being potentially dangerous is not a blocker for merge as long as it is useful for special-needs use(r)s.

Finally, this PR is tangentially related to my earlier PR #144 which allows our developers to interactively "Reassign" to themselves a resource locked by a running test job (e.g. when they see a failure and want to investigate after test), without risk that this resource gets re-used by someone else as soon as it is unlocked (interactively or by ending the test run).

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

= ORIGINAL SYMPTOMS

In practice, we found that:

  • sometimes the lockable resource was grabbed by another lock() request even though it remained reserved
  • sometimes the un-reserved resource remained available and jobs waiting for it to become free did not proceed; manually locking and unlocking that resource made it actually usable for those jobs
  • as more cases were encountered, this situation was linked with resource starvation - when fewer resources were available overall which could fulfill the locking request, than there were requests pending. This matches both locking requests issued for resource name, and requests for label match (with one or few labeled instances and many requests).

Investigation led to code, and so this PR was born.

= SOLUTION

Notable points include:

  • I found that the original method to checkResourcesAvailability() did not consider whether resource isReserved(), as long as it was listed among lockedResourcesAboutToBeUnlocked – and so led to immediate re-use of reserved resources IFF there was already a queued request waiting which such resource matched

    • I started with a quick fix to just consider isReserved() -- which did help against "losing" the reservation during unlock(), but did not solve everything at once
  • the method also behaved identically for un-locking and un-reserving resources (had no way to differentiate), so that quick fix to just consider isReserved(), did not help "un-stuck" the waiting jobs when the resource was finally un-reserved as well

    • I added differentiated handling with a similar reservedResourcesAboutToBeUnreserved list
  • LockableResource methods such as unReserve(), reset() and setBuild(null) for "unlock()" effect, reasonably only changed fields of the resource instance and did not deal with LRM for the bigger picture

    • I added recycle() methods in both LR and LRM classes to help announce that the resource may be re-used instantly without much effort on pipeline developer side
  • in TDD style, initially tests reproduced and confirmed the bugs I was hunting, both for situations where the LR (LockableResource) instances were manipulated directly, and where LRM (LockableResourcesManager) was requested to act

    • as time went, fixes appeared and became confirmed by those added and evolved tests
  • Tested explicitly several cases:

    • resource reserved inside a lock closure, now remains reserved (and not locked by anyone else from the queue) when the closure ends
    • when such reserved-but-not-locked resource is finally released using old LRM.unreserve([LR]), or the new LRM.recycle([LR]) or LR.recycle(), it is immediately usable by someone from the queue
    • resource reserved and later un-reserved inside the same lock closure is not kidnapped by anyone else from the queue
  • NOTE: did not test, but suppose, that recycle() from inside the lock closure is a bad idea: if the resource becomes unlocked due to this and immediately/soon becomes used (locked) by someone else, then when the original closure completes and unlock() is called, it might be used by a third consumer and break the work of the second one.

…cked as) reserved but there was a queue for it
…onsider available candidate resources that are reserved
…isregard available candidate resources that are reserved (get notified and use them when finally released by user)
…some of the logic and implications in human language
…d getNextQueuedContext() to consider locked vs reserved resources differently
…cycle() method to cover unlock()+unreserve() and queue the un-stucking for resources that were setReservedBy() inside a lock step
…to determine candidate nuances once and maintain a more compact codebase
… survived completely (by message it echoes in the end)
…etting lock#2 to not confuse the log-parser occasionally
…ving and unreserving a resource while in lock{} closure does not let others grab it
…re that reaching stage 1-5 (after we unreserve lock#1) means that lock#2 should not have been taken yet
…s (expected bugs indeed not encountered) into test log
…seless and toxic for letting go of resources queued by someone else; switch to testing the new lr.recycle() method
…ecycle([lr]) of a reserved but not locked resource makes it available for those who wait on the queue
…requesting lock() from parallel stages added just for pressure, so they do not intervene in other tests
…veat for recycle() - do not use from lock{} step
…nreserve inside lock{}, sleep a bit between these two operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant