-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition when resuming an aborted run #5600
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ben Sherman <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Alternative solution would be for the task to always delete |
Not sure to understand on what resource the race condition happens? |
Maybe it's not so much a race condition as it is a cache corruption Nextflow re-executes a task in the same directory as before, and because the previous |
Basically we are saying the task fails |
The task completes with exit code |
ok, got it |
resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null | ||
if( resumeDir ) | ||
exists = resumeDir.exists() | ||
workDir = entry | ||
? FileHelper.asPath(entry.trace.getWorkDir()) | ||
: task.getWorkDirFor(hash) | ||
if( workDir ) | ||
exists = workDir.exists() | ||
|
||
log.trace "[${safeTaskName(task)}] Cacheable folder=${resumeDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry" | ||
final cached = shouldTryCache && exists && entry.trace.isCompleted() && checkCachedOutput(task.clone(), resumeDir, hash, entry) | ||
log.trace "[${safeTaskName(task)}] Cacheable folder=${workDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry" | ||
final cached = shouldTryCache && exists && entry && entry.trace.isCompleted() && checkCachedOutput(task.clone(), workDir, hash, entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of that piece of code the lesser the changes the better, both to make it simple to review and to keep history readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know. This is about as simple as it gets while keeping the intent clear.
- check for a cache entry and it's work dir, or compute the work dir if there is no cache entry
- check whether the work dir exists
- check for cached outputs if the cache entry and work dir exist and the task completed
and further down:
4. if the outputs are cached then use them
5. otherwise, if the work dir exists then use a new work dir
6. otherwise, create the work dir and use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, getting better, but why adding entry &&
in the cached
condition? If the intent is to applied the same logic when the entry
is missing should not the condition remain the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous logic, it wouldn't check for cached outputs if the cache entry was missing. That is also how it behaves here. The && entry
is required to prevent a null reference exception on entry.trace.isCompleted()
Signed-off-by: Ben Sherman <[email protected]>
final lock = lockManager.acquire(hash) | ||
final workDir = task.getWorkDirFor(hash) | ||
try { | ||
if( resumeDir != workDir ) | ||
exists = workDir.exists() | ||
if( !exists && !workDir.mkdirs() ) | ||
if( !workDir.mkdirs() ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been wondering about this lock over the task directory. I think the purpose is to prevent two tasks from using the same directory. But in that case maybe it should be over the previous try-catch block?
It should prevent two tasks from checking the same directory at the same time, because that is how a task determines whether to use the directory or try a different one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may. however what I'm thinking that to solve this issue it should be assumed a new task run should always use a newly created directory. Not sure this logic satisfy it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently this lock is useless, because if two tasks request the same directory, the lock will serialize the mkdirs()
call but won't actually cause the second task to move to a different directory
Close #5595
When a Nextflow run is aborted for any reason, Nextflow will try to kill all jobs and save them to the cache as failed tasks, so that on a resumed run, the new jobs will use new task directories.
However, if Nextflow is unable to complete this cleanup for any reason, such as an abrupt crash, the task cache might not be updated. On a resumed run, the new task will be re-executed in the same directory as before:
nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy
Lines 806 to 826 in b23e42c
What I think happens here is:
This can cause a race condition because if the begin and exit files are still present, Nextflow could submit a job, detect the begin/exit codes from the previous job, mark the job as completed (as long as the required outputs are still present), and launch downstream tasks on truncated data.
I think this can only happen on grid executors because the GridTaskHandler checks the begin/exit files before polling the scheduler. The cloud executors poll their API first, and the local executor doesn't check these files at all.
It should be possible to replicate this issue on a grid executor by cancelling a run with
-disable-jobs-cancellation
and allowing the jobs to complete before resuming the run. Which by the way is an unsafe property of that CLI option.I'm trying to modify the local executor to simulate the issue but haven't been able to replicate yet