[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279

jimklimov · 2021-12-18T18:54:49Z

= SUMMARY

As detailed in https://issues.jenkins.io/browse/JENKINS-67403 we had a problem with a sort of non-standard use of lockable resources, namely where we do explicitly want them remaining "not-lockable" (e.g. "locked" or "reserved") after our lock closure ends, sometimes even after the job ends. In this manner, the lockable resource is eventually un-reserved either interactively (e.g. by developer doing a post-mortem on a SUT appliance), or programmatically (later in same job, by a regular clean-up job, etc.)

We found that while this approach worked for us in most cases, sometimes it "acted weird" - especially when there were more requests for resources asked than could be served instantly (so they were queued). More details below.

I understand (though do not share) the reasoning which stalled PR #64 about exposing steps to acquire and release the locks, as opposing the declarative use of lockable-resources-plugin. IMHO, we should make a tool usable and versatile - even if it can get people to shoot themselves in the foot; this dangerous side should be documented and announced as such, but being potentially dangerous is not a blocker for merge as long as it is useful for special-needs use(r)s.

Finally, this PR is tangentially related to my earlier PR #144 which allows our developers to interactively "Reassign" to themselves a resource locked by a running test job (e.g. when they see a failure and want to investigate after test), without risk that this resource gets re-used by someone else as soon as it is unlocked (interactively or by ending the test run).

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

= ORIGINAL SYMPTOMS

In practice, we found that:

sometimes the lockable resource was grabbed by another lock() request even though it remained reserved
sometimes the un-reserved resource remained available and jobs waiting for it to become free did not proceed; manually locking and unlocking that resource made it actually usable for those jobs
as more cases were encountered, this situation was linked with resource starvation - when fewer resources were available overall which could fulfill the locking request, than there were requests pending. This matches both locking requests issued for resource name, and requests for label match (with one or few labeled instances and many requests).

Investigation led to code, and so this PR was born.

= SOLUTION

Notable points include:

I found that the original method to checkResourcesAvailability() did not consider whether resource isReserved(), as long as it was listed among lockedResourcesAboutToBeUnlocked – and so led to immediate re-use of reserved resources IFF there was already a queued request waiting which such resource matched
- I started with a quick fix to just consider isReserved() -- which did help against "losing" the reservation during unlock(), but did not solve everything at once
the method also behaved identically for un-locking and un-reserving resources (had no way to differentiate), so that quick fix to just consider isReserved(), did not help "un-stuck" the waiting jobs when the resource was finally un-reserved as well
- I added differentiated handling with a similar reservedResourcesAboutToBeUnreserved list
LockableResource methods such as unReserve(), reset() and setBuild(null) for "unlock()" effect, reasonably only changed fields of the resource instance and did not deal with LRM for the bigger picture
- I added recycle() methods in both LR and LRM classes to help announce that the resource may be re-used instantly without much effort on pipeline developer side
in TDD style, initially tests reproduced and confirmed the bugs I was hunting, both for situations where the LR (LockableResource) instances were manipulated directly, and where LRM (LockableResourcesManager) was requested to act
- as time went, fixes appeared and became confirmed by those added and evolved tests
Tested explicitly several cases:
- resource reserved inside a lock closure, now remains reserved (and not locked by anyone else from the queue) when the closure ends
- when such reserved-but-not-locked resource is finally released using old LRM.unreserve([LR]), or the new LRM.recycle([LR]) or LR.recycle(), it is immediately usable by someone from the queue
- resource reserved and later un-reserved inside the same lock closure is not kidnapped by anyone else from the queue
NOTE: did not test, but suppose, that recycle() from inside the lock closure is a bad idea: if the resource becomes unlocked due to this and immediately/soon becomes used (locked) by someone else, then when the original closure completes and unlock() is called, it might be used by a third consumer and break the work of the second one.

…cked as) reserved but there was a queue for it

…onsider available candidate resources that are reserved

…anted after release of hacked lock)

…isregard available candidate resources that are reserved (get notified and use them when finally released by user)

…some of the logic and implications in human language

…d getNextQueuedContext() to consider locked vs reserved resources differently

…cycle() method to cover unlock()+unreserve() and queue the un-stucking for resources that were setReservedBy() inside a lock step

…to determine candidate nuances once and maintain a more compact codebase

…activities

…cking" activities

… unstucking" activities

… survived completely (by message it echoes in the end)

…etting lock#2 to not confuse the log-parser occasionally

…ving and unreserving a resource while in lock{} closure does not let others grab it

…re that reaching stage 1-5 (after we unreserve lock#1) means that lock#2 should not have been taken yet

…s (expected bugs indeed not encountered) into test log

…seless and toxic for letting go of resources queued by someone else; switch to testing the new lr.recycle() method

…ecycle([lr]) of a reserved but not locked resource makes it available for those who wait on the queue

…requesting lock() from parallel stages added just for pressure, so they do not intervene in other tests

…of lock states between p2 and p3

…veat for recycle() - do not use from lock{} step

…nreserve inside lock{}, sleep a bit between these two operations

jimklimov added 22 commits December 17, 2021 21:32

LockStepTest.java: add test cases for re-locked resource that was (ha…

b7a4d13

…cked as) reserved but there was a queue for it

LockableResourcesManager.java: checkResourcesAvailability(): do not c…

71a89cb

…onsider available candidate resources that are reserved

LockStepTest.java: add handling of stuck LRM (queued locks are not gr…

65bf08a

…anted after release of hacked lock)

LockableResourcesManager.java: checkResourcesAvailability(): do not d…

00e6447

…isregard available candidate resources that are reserved (get notified and use them when finally released by user)

LockableResourcesManager.java: checkResourcesAvailability(): comment …

a030099

…some of the logic and implications in human language

LockableResourcesManager.java: extend checkResourcesAvailability() an…

14476d2

…d getNextQueuedContext() to consider locked vs reserved resources differently

LockableResourcesManager.java + LockableResource.java: introduce a re…

5d93d2b

…cycle() method to cover unlock()+unreserve() and queue the un-stucking for resources that were setReservedBy() inside a lock step

LockableResourcesManager.java: refactor checkResourcesAvailability() …

58efc3b

…to determine candidate nuances once and maintain a more compact codebase

LockStepTest.java: reserveInsideLockHonoured(): reword the stage 1-4 …

831847e

…activities

LockStepTest.java: reserveInsideLockHonoured(): reword the "LRM unstu…

ec9f721

…cking" activities

LockStepTest.java: setReservedByInsideLockHonoured(): reword the "LRM…

833fb42

… unstucking" activities

LockStepTest.java: *reserve*InsideLockHonoured(): track that pipeline…

8e758c3

… survived completely (by message it echoes in the end)

LockStepTest.java: *reserve*InsideLockHonoured(): sleep a bit after g…

7f6353d

…etting lock#2 to not confuse the log-parser occasionally

LockStepTest.java: reserveInsideLockHonoured(): add a test that reser…

f16b06e

…ving and unreserving a resource while in lock{} closure does not let others grab it

LockStepTest.java: *reserve*InsideLockHonoured(): do not assume anymo…

2164baa

…re that reaching stage 1-5 (after we unreserve lock#1) means that lock#2 should not have been taken yet

LockStepTest.java: *reserve*InsideLockHonoured(): report GOOD outcome…

a6318ba

…s (expected bugs indeed not encountered) into test log

LockStepTest.java: setReservedByInsideLockHonoured(): lr.reset() is u…

2eae161

…seless and toxic for letting go of resources queued by someone else; switch to testing the new lr.recycle() method

LockStepTest.java: reserveInsideLockHonoured(): add a test that LRM.r…

3666264

…ecycle([lr]) of a reserved but not locked resource makes it available for those who wait on the queue

LockStepTest.java: *reserve*InsideLockHonoured(): sleep a bit before …

0165b57

…requesting lock() from parallel stages added just for pressure, so they do not intervene in other tests

LockStepTest.java: reserveInsideLockHonoured(): report GOOD outcomes …

7ac8386

…of lock states between p2 and p3

LockableResourcesManager.java + LockableResource.java: comment the ca…

e3976f4

…veat for recycle() - do not use from lock{} step

LockStepTest.java: reserveInsideLockHonoured(): in test for reserve-u…

c8477bc

…nreserve inside lock{}, sleep a bit between these two operations

jimklimov mentioned this pull request Jan 31, 2022

Reserved timestamp (continuation of #226) #273

Merged

6 tasks

jimklimov merged commit c0fe371 into jenkinsci:master Feb 1, 2022

jimklimov deleted the JENKINS-67403 branch February 1, 2022 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279

[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279

jimklimov commented Dec 18, 2021

[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279

[JENKINS-67403] Lockable resources acts weird when resource is reserved while locked #279

Conversation

jimklimov commented Dec 18, 2021