-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: make LocalCache
not use synchronized
to detect recursive loads (#6845)
#6851
refactor: make LocalCache
not use synchronized
to detect recursive loads (#6845)
#6851
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
(Thanks. We do need a signed CLA before we can look at this. The link above has some instructions.) |
Yeah, I'm working on it. I need a corporate CLA and am still figuring out from whom to get one at my company ^^. |
…on (#6845) - replaces the previous synchronized implementation to make it work gracefully with VirtualThreads - remember the thread that created a LoadingValueReference to later determine whether the "loader" is the same thread as the waiting thread, and we are therefore inside a deadlock
@cpovirk sorry it took a while, but the CLA has been signed now, so we can continue with everything else now I think. |
@@ -2205,7 +2200,14 @@ V waitForLoadingValue(ReferenceEntry<K, V> e, K key, ValueReference<K, V> valueR | |||
throw new AssertionError(); | |||
} | |||
|
|||
checkState(!Thread.holdsLock(e), "Recursive load of: %s", key); | |||
if (e.getValueReference() instanceof LoadingValueReference) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together!
Would it make sense to check the valueReference
instead of e.getValueReference()
? I ask because I discovered during testing that the current code has a race: It checks e.getValueReference() instanceof LoadingValueReference
, and then it casts e.getValueReference()
to LoadingValueReference
, but the first e.getValueReference()
call might return a loading reference and the second a completed reference.
One way to fix that is to call e.getValueReference()
only once (e.g., checkRecursiveLoad(key, e.getValueReference());
). But it would seem easier and less fragile to use valueReference
if that's correct.
The worst thing that I can think of about using valueReference
is that we might check the loading thread unnecessarily (in the case that the reference completed between our earlier isLoading()
check and this one). But that doesn't seem like a significant worry.
It actually looks like we could change waitForLoadingValue
to require a LoadingValueReference
instead of a plain ValueReference
: The code already checks this above, so we could just perform a cast before the call instead of a conditional cast here.
I have pretty well talked myself into trying this :) I'll report back the results. But please do speak up if I'm missing anything!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No your right I think (and the race could have definitely become a problem, good catch!). Just getting the ValueReference
directly and not from the entry also seems much more straightforward in general, I think I must have overlooked that it's already present as a method parameter.
I am a bit worried about changing the method signature to LoadingValueReference
though. Although I don't see it creating any immediate problems in the code, it feels weird that we are inferring the "is loading" information both from the type and by calling the method. Thus far I think the code relies on calling the method, not the type (this test p.e sets the loading property without being of type LoadingValueReference
, and when overlooking the code I didn't see any instanceof
/ typecasts for LoadingValueReference
but quite a lot of calls to isLoading()
). I'm not sure if it's a good idea to introduce two ways of checking for this property.
Of course we at least need to check the type before retrieving the loading thread, but we could do it so that if the value isLoading()
but is not of type LoadingValueReference
we simply omit the recursion check (since we don't have a choice) but otherwise continue as usual, without throwing a ClassCastException
.
I don't generally like that we rely on isLoading()
instead of the type, but since we seem to be doing it everywhere else I think we should not make an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch! I had convinced myself that LoadingValueReference
was the only type for which isLoading()
was true, but I had neglected to look at the tests. I agree that the current setup is a little weird and also that it's not worth fighting against. I'll update my version and retest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I suppose that another option is to pull getLoadingThread()
up into ValueReference
and have it return null
in the non-loading case. But it's probably safest to keep a clear delineation between the prod LoadingValueReference
, which has the invariants that isLoading()
is always true
and getLoadingThread()
is always non-null, and whatever it is that the test implementation does, which at present does not have such invariants :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, back to your approach, just using valueReference
instead, in https://github.com/google/guava/pull/6857/files#diff-98cf5b8d5ed981e94f4c745137a9b4ac3d0e896d0471cfb4e3baf85928c5053d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, let me know how when you find out anything new 👍.
Hi! Multiple teams in our company recently pulled this commit and notice an obvious increase of Error message we have seen
I think what happened is |
can you provide a unit test to reproduce the problem? I think that would help expedite a fix. |
Thanks for quick reply. Here is my unit test and Im able to reproduce it. Create cache loader with async refresh in another thread pool. Let async refresh to be a long process so that when key is expired (expiredAfterWrite), the async reload is still ongoing. And I notice the
Result:
This issue should happen when async fresh takes long enough that item got expired thus hit that logic ( |
Thanks! Then I think this change should be reverted. When a refresh is scheduled by a cache read by the caller it creates a new This recursive check predated the addition of refreshAfterWrite so I hadn't considered that scenario. By luck the @cpovirk please review |
We could also just not set the loader on creation, but rather directly before calling But I understand If this is too high risk, since we already broke something... |
Yes, I'll roll this back once I get in to work this morning. Sorry for the trouble, and thank you for the report. (I don't love that I'm basically arguing to treat this code as "haunted" instead of actually working to find the best solution, but I think that's our safest bet at this point.) |
Thanks folks for quick action here and really appreciated for seeking for safest and best solution! |
…12730). The `LocalCache` change led to [false reports of recursion during refresh](#6851 (comment)). This CL keeps the new tests from the `LocalCache` change, which already would have passed before the CL. Ideally we would additionally include [tests that demonstrate the refresh problem](#6851 (comment)), but we'll need to see if the contributor can sign a CLA. This rollback un-fixes #6845. Users of `common.cache` and virtual threads will likely have to wait until `synchronized` no longer pins threads. (And at that point, if not earlier, they should migrate to [Caffeine](https://github.com/ben-manes/caffeine) :)) RELNOTES=`cache`: Fixed a bug that could cause [false "recursive load" reports during refresh](#6851 (comment)). PiperOrigin-RevId: 605049501
I got a bit sidetracked, but #6972 is now out for internal review. @paggynie, if you're interested in contributing your test, you could send a PR to add it to (Hopefully we'll just remember that this code is haunted and try not to touch it again, test or no test... :) But our team has many people on it, and even those of us directly involved might still forget.... Still, no pressure.) |
…12730). The `LocalCache` change led to [false reports of recursion during refresh](#6851 (comment)). This CL keeps the new tests from the `LocalCache` change, which already would have passed before the CL. Ideally we would additionally include [tests that demonstrate the refresh problem](#6851 (comment)), but we'll need to see if the contributor can sign a CLA. This rollback un-fixes #6845. Users of `common.cache` and virtual threads will likely have to wait until `synchronized` no longer pins threads. (And at that point, if not earlier, they should migrate to [Caffeine](https://github.com/ben-manes/caffeine) :)) RELNOTES=`cache`: Fixed a bug that could cause [false "recursive load" reports during refresh](#6851 (comment)). PiperOrigin-RevId: 605049501
…12730). The `LocalCache` change led to [false reports of recursion during refresh](#6851 (comment)). This CL keeps the new tests from the `LocalCache` change, which already would have passed before the CL. Ideally we would additionally include [tests that demonstrate the refresh problem](#6851 (comment)), but we'll need to see if the contributor can sign a CLA. This rollback un-fixes #6845. Users of `common.cache` and virtual threads will likely have to wait until `synchronized` no longer pins threads. (And at that point, if not earlier, they should migrate to [Caffeine](https://github.com/ben-manes/caffeine) :)) RELNOTES=`cache`: Fixed a bug that could cause [false "recursive load" reports during refresh](#6851 (comment)). PiperOrigin-RevId: 605069776
Thanks for rolling it back. |
@cortlepp-intershop You might want to try the new EA builds that no longer pin on synchronization. That might help you in general and they are asking for feedback. See their announcement for details. |
@ben-manes Thanks so much for the heads-up! I already got the email and am currently evaluating the best way to cram their tarball into a container image. But however that will go, it makes me hopeful that there might be an official and released solution for monitors soon(ish). |
@paggynie This is released in Guava 33.1.0 (I just saw @cpovirk's announcement in [email protected]) |
…1.0-jre` ### What changes were proposed in this pull request? The pr aims to upgrade Guava used by the `connect` module to `33.1.0-jre`. ### Why are the changes needed? - The new version bring some bug fixes and optimizations as follows: cache: Fixed a bug that could cause google/guava#6851 (comment). hash: Optimized Checksum-based hash functions for Java 9+. - The full release notes: https://github.com/google/guava/releases/tag/v33.1.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45540 from panbingkun/SPARK-47426. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Hi @ben-manes @cpovirk I have started a thread in guava-discuss seeking some clarification |
…1.0-jre` ### What changes were proposed in this pull request? The pr aims to upgrade Guava used by the `connect` module to `33.1.0-jre`. ### Why are the changes needed? - The new version bring some bug fixes and optimizations as follows: cache: Fixed a bug that could cause google/guava#6851 (comment). hash: Optimized Checksum-based hash functions for Java 9+. - The full release notes: https://github.com/google/guava/releases/tag/v33.1.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45540 from panbingkun/SPARK-47426. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Bump guava from 32.1.3-jre to 33.1.0-jre. ### Why are the changes needed? Guava v33.1.0 has been released, which release note refers to [v33.1.0](https://github.com/google/guava/releases/tag/v33.1.0). v33.1.0 brings some bug fixes and optimizations as follows: * cache: Fixed a bug that could cause google/guava#6851 (comment) for `CacheLoader`/`CacheBuilder`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2439 from SteNicholas/CELEBORN-1366. Authored-by: SteNicholas <[email protected]> Signed-off-by: mingji <[email protected]>
### What changes were proposed in this pull request? Bump guava from 32.1.3-jre to 33.1.0-jre. ### Why are the changes needed? Guava v33.1.0 has been released, which release note refers to [v33.1.0](https://github.com/google/guava/releases/tag/v33.1.0). v33.1.0 brings some bug fixes and optimizations as follows: * cache: Fixed a bug that could cause google/guava#6851 (comment) for `CacheLoader`/`CacheBuilder`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2439 from SteNicholas/CELEBORN-1366. Authored-by: SteNicholas <[email protected]> Signed-off-by: mingji <[email protected]>
This PR refactors
LocalCache
to no longer usesynchronized
in order to detect recursive loads and fail fast instead of deadlocking. The main motivation for this change is to makeLocalCache
work gracefully with JDK21VirtualThreads
when using a blocking operation inside a loader.The new implementation saves the current
Thread
when creating aLoadingValueReference
and checks it before waiting for the entry. Since theLoadingValueReference
is swapped out eventually for the real entry, the refernce to theThread
exists only temporarily.I also added some tests for the recursive load scenario, both for the direct load and if it occurs with a proxy loader in between.
Closes #6845.