-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm: different workers on the same VM don't share the same cache #3710
Comments
This comment has been minimized.
This comment has been minimized.
Another solution:
Solution we agreed on:
|
About locking / sqlite: |
Storage for GKE:
Storage for Slurm:
|
Couple of issues with FileLock:
The test fails intermittently with a few locks, but fails all the time with many locks. |
The stress test: @pytest.mark.parametrize("lock_type", [SoftFileLock])
@pytest.mark.skipif(hasattr(sys, "pypy_version_info") and sys.platform == "win32", reason="deadlocks randomly")
def test_threaded_lock_different_lock_obj_many(lock_type: Type[BaseFileLock], tmp_path: Path) -> None:
# Runs multiple threads, which acquire the same lock file with a different FileLock object.
# When thread group i acquires the lock, all the other threads in different groups must not not hold the lock.
number_of_locks = 100
number_of_groups = 3
# TODO: change to path over NFS
lock_path = tmp_path / "a"
def create_run(lock_id):
lock = locks[lock_id]
def t() -> None:
for _ in range(10_000):
with lock:
assert lock.is_locked
for i, lock_to_check in enumerate(locks):
if i == lock_id:
continue
assert not lock_to_check.is_locked
assert lock.is_locked
return t
locks = [lock_type(str(lock_path)) for _ in range(number_of_locks)]
all_threads = [
[ExThread(create_run(lock_id), f"g{group}_t{lock_id}") for lock_id in range(number_of_locks)]
for group in range(number_of_groups)
]
for threads in all_threads:
for thread in threads:
print(thread)
thread.start()
for threads in all_threads:
for thread in threads:
thread.join()
for lock in locks:
assert not lock.is_locked |
stack trace |
@teetone can you print the stack trace that you got, and can you ensure that you did run it with the |
@epicfaace I updated the comment above with the full stack trace and I verified it fails with |
Next steps (by 10/27)
|
Latest design to solve this issue is in this doc: https://docs.google.com/document/d/13_vV7CGngek_I6BWmD7wNNjsN5TMunR8qVIGtfJYzP0/edit |
Currently testing and debugging with Ashwin. |
Tony, are there any new issues since the code fix we added on Wednesday?
…--
Ashwin Ramaswami
On Fri, Feb 4, 2022 at 2:07 PM Pranav Jain ***@***.***> wrote:
Currently testing and debugging with Ashwin.
Tentatively, putting up a PR by Feb 11.
—
Reply to this email directly, view it on GitHub
<#3710 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM4MXYRGLYLOQTJYQCGDVTUZQPWNANCNFSM5ALPBORA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It looks promising. I'm still testing at the moment. |
@teetone Please update the ETA |
@teetone is there any update on this one? Can you please update the ETA? |
Merged this one. Will be testing on CS324 once @epicfaace deploys it on that instance. |
For the slurm worker manager
The text was updated successfully, but these errors were encountered: