Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm: different workers on the same VM don't share the same cache #3710

Closed
epicfaace opened this issue Jul 14, 2021 · 17 comments · Fixed by #3861
Closed

Slurm: different workers on the same VM don't share the same cache #3710

epicfaace opened this issue Jul 14, 2021 · 17 comments · Fixed by #3861
Assignees
Labels
goal-performance p1 Do it in the next two weeks.

Comments

@epicfaace
Copy link
Member

For the slurm worker manager

@epicfaace epicfaace added goal-performance p1 Do it in the next two weeks. labels Jul 14, 2021
@epicfaace

This comment has been minimized.

@epicfaace
Copy link
Member Author

epicfaace commented Jul 16, 2021

Another solution:

  • have one dependency manager per directory (usually just one running per VM), workers can communicate with it through API

Solution we agreed on:

  • Multiple dependency managers use a sqlite database as a lockfile, which is stored in the shared "cached dependencies" folder
    • store: reference count (number of bundles currently using this dependency) per dependency -- once it reaches 0, it's OK to delete
    • store: if a dependency is currently being downloaded (so we don't have the same dependency being downloaded multiple times) (note: we can store the last time / last used to keep track of dead workers)
    • store: if a dependency is currently being deleted (so we don't have the same dependency being deleted multiple times)
  • update dependency manager to store worker-state and runs locally, but store cached dependencies in a shared location

@epicfaace
Copy link
Member Author

About locking / sqlite:

https://www.sqlite.org/lockingv3.html

#2366

@epicfaace
Copy link
Member Author

Storage for GKE:
Objective: Share a cache among VMs, each VM has one worker at a time.

  • dependencies - stored in the PersistentVolume (common for all workers)
  • worker-state.json for each worker - stored locally on the VM
  • runs - store locally on the VM (maybe with Support model checkpoints #3716, we could store it on the PersistentVolume)

Storage for Slurm:
Objective: Share a cache among workers in a single VM, each VM has multiple workers at a time.

  • dependencies - stored on the VM (shared between workers)
  • worker-state.json for each worker - stored on the VM in a worker-specific directory
  • runs - stored on the VM in a worker-specific directory

@teetone
Copy link
Collaborator

teetone commented Oct 13, 2021

Couple of issues with FileLock:

  1. Basic concurrency test is failing on Windows, so the test is disabled: Soft filelock sometimes deadlocks on PyPy3 Windows tox-dev/filelock#101
  2. The stress test I created is failing over NFS (cluster) with the following error:
============================= test session starts ==============================
platform linux -- Python 3.6.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /sailhome/tonyhlee/codalab/py-filelock, configfile: tox.ini
collected 16 items

tests/test_filelock_codalab.py FF..............                          [100%]

=================================== FAILURES ===================================
___________ test_threaded_lock_different_lock_obj_many[SoftFileLock] ___________

self = <ExThread(g0_t4, stopped 139857540142848)>

    def run(self) -> None:
        try:
>           super().run()

tests/test_filelock_codalab.py:201: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ExThread(g0_t4, stopped 139857540142848)>

    def run(self):
        """Method representing the thread's activity.
    
        You may override this method in a subclass. The standard run() method
        invokes the callable object passed to the object's constructor as the
        target argument, if any, with sequential and keyword arguments taken
        from the args and kwargs arguments, respectively.
    
        """
        try:
            if self._target:
>               self._target(*self._args, **self._kwargs)

/u/nlp/anaconda/main/anaconda3/envs/tonyhlee_codalab/lib/python3.6/threading.py:864: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def t() -> None:
        for _ in range(10_000):
>           with lock:

tests/test_filelock_codalab.py:33: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._soft.SoftFileLock object at 0x7f332568af98>

    def __enter__(self) -> "BaseFileLock":
        """
        Acquire the lock.
    
        :return: the lock object
        """
>       self.acquire()

src/filelock/_api.py:199: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._soft.SoftFileLock object at 0x7f332568af98>, timeout = -1.0
poll_intervall = 0.05

    def acquire(self, timeout: Optional[float] = None, poll_intervall: float = 0.05) -> AcquireReturnProxy:
        """
        Try to acquire the file lock.
    
        :param timeout: maximum wait time for acquiring the lock, ``None`` means use the default :attr:`~timeout` is and
         if ``timeout < 0``, there is no timeout and this method will block until the lock could be acquired
        :param poll_intervall: interval of trying to acquire the lock file
        :raises Timeout: if fails to acquire lock within the timeout period
        :return: a context object that will unlock the file when the context is exited
    
        .. code-block:: python
    
            # You can use this method in the context manager (recommended)
            with lock.acquire():
                pass
    
            # Or use an equivalent try-finally construct:
            lock.acquire()
            try:
                pass
            finally:
                lock.release()
    
        .. versionchanged:: 2.0.0
    
            This method returns now a *proxy* object instead of *self*,
            so that it can be used in a with statement without side effects.
    
    
        """
        # Use the default timeout, if no timeout is provided.
        if timeout is None:
            timeout = self.timeout
    
        # Increment the number right at the beginning. We can still undo it, if something fails.
        with self._thread_lock:
            self._lock_counter += 1
    
        lock_id = id(self)
        lock_filename = self._lock_file
        start_time = time.time()
        try:
            while True:
                with self._thread_lock:
                    if not self.is_locked:
                        _LOGGER.debug("Attempting to acquire lock %s on %s", lock_id, lock_filename)
>                       self._acquire()

src/filelock/_api.py:155: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <filelock._soft.SoftFileLock object at 0x7f332568af98>

    def _acquire(self) -> None:
>       raise_on_exist_ro_file(self._lock_file)

src/filelock/_soft.py:13: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

filename = '/sailhome/tonyhlee/codalab/codalab.txt.lock'

    def raise_on_exist_ro_file(filename: str) -> None:
        try:
            file_stat = os.stat(filename)  # use stat to do exists + can write to check without race condition
        except OSError:
            return None  # swallow does not exist or other errors
    
        if file_stat.st_mtime != 0:  # if os.stat returns but modification is zero that's an invalid os.stat - ignore it
            if not (file_stat.st_mode & stat.S_IWUSR):
>               raise PermissionError(f"Permission denied: {filename!r}")
E               PermissionError: Permission denied: '/sailhome/tonyhlee/codalab/codalab.txt.lock'

src/filelock/_util.py:13: PermissionError

The test fails intermittently with a few locks, but fails all the time with many locks.

@teetone
Copy link
Collaborator

teetone commented Oct 13, 2021

The stress test:

@pytest.mark.parametrize("lock_type", [SoftFileLock])
@pytest.mark.skipif(hasattr(sys, "pypy_version_info") and sys.platform == "win32", reason="deadlocks randomly")
def test_threaded_lock_different_lock_obj_many(lock_type: Type[BaseFileLock], tmp_path: Path) -> None:
    # Runs multiple threads, which acquire the same lock file with a different FileLock object.
    # When thread group i acquires the lock, all the other threads in different groups must not not hold the lock.
    number_of_locks = 100
    number_of_groups = 3
    # TODO: change to path over NFS
    lock_path = tmp_path / "a"

    def create_run(lock_id):
        lock = locks[lock_id]

        def t() -> None:
            for _ in range(10_000):
                with lock:
                    assert lock.is_locked
                    for i, lock_to_check in enumerate(locks):
                        if i == lock_id:
                            continue
                        assert not lock_to_check.is_locked
                    assert lock.is_locked
        return t

    locks = [lock_type(str(lock_path)) for _ in range(number_of_locks)]
    all_threads = [
        [ExThread(create_run(lock_id),  f"g{group}_t{lock_id}") for lock_id in range(number_of_locks)]
        for group in range(number_of_groups)
    ]

    for threads in all_threads:
        for thread in threads:
            print(thread)
            thread.start()

    for threads in all_threads:
        for thread in threads:
            thread.join()

    for lock in locks:
        assert not lock.is_locked

@epicfaace
Copy link
Member Author

stack trace

@epicfaace
Copy link
Member Author

@teetone can you print the stack trace that you got, and can you ensure that you did run it with the assert not lock_to_check.is_locked code put in?

@teetone
Copy link
Collaborator

teetone commented Oct 14, 2021

@epicfaace I updated the comment above with the full stack trace and I verified it fails with assert not. I copied the wrong file.

@epicfaace
Copy link
Member Author

epicfaace commented Oct 20, 2021

Next steps (by 10/27)

  • Compare: how long it takes to read/write from a JSON file normally, vs how long it takes to read/write from a JSON file locked when 50 different workers are trying to access it.
  • Finish implementing the PR

@epicfaace
Copy link
Member Author

Latest design to solve this issue is in this doc: https://docs.google.com/document/d/13_vV7CGngek_I6BWmD7wNNjsN5TMunR8qVIGtfJYzP0/edit

@pranavjain
Copy link
Contributor

Currently testing and debugging with Ashwin.
Tentatively, putting up a PR by Feb 11.

@epicfaace
Copy link
Member Author

epicfaace commented Feb 5, 2022 via email

@teetone
Copy link
Collaborator

teetone commented Feb 5, 2022

Tony, are there any new issues since the code fix we added on Wednesday?

-- Ashwin Ramaswami
On Fri, Feb 4, 2022 at 2:07 PM Pranav Jain @.> wrote: Currently testing and debugging with Ashwin. Tentatively, putting up a PR by Feb 11. — Reply to this email directly, view it on GitHub <#3710 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM4MXYRGLYLOQTJYQCGDVTUZQPWNANCNFSM5ALPBORA . You are receiving this because you were mentioned.Message ID: @.>

It looks promising. I'm still testing at the moment.

@pranavjain
Copy link
Contributor

@teetone Please update the ETA

@pranavjain
Copy link
Contributor

@teetone is there any update on this one? Can you please update the ETA?

@pranavjain
Copy link
Contributor

Merged this one. Will be testing on CS324 once @epicfaace deploys it on that instance.

@mergify mergify bot closed this as completed in #3861 Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal-performance p1 Do it in the next two weeks.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants