Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Progress tracking on long-running jobs #217

Open
MaybeJustJames opened this issue Sep 21, 2020 · 5 comments
Open

[FEATURE REQUEST] Progress tracking on long-running jobs #217

MaybeJustJames opened this issue Sep 21, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@MaybeJustJames
Copy link

Many SCENIC jobs are very long running and a user can wonder if progress is being made. Having a mechanism to track progress would be very useful.

@cflerin
Copy link
Contributor

cflerin commented Sep 25, 2020

Hi @MaybeJustJames ,

Thanks for the suggestion. pySCENIC does actually have process tracking built in for a number of steps, although it's maybe not always obvious.

  • GRN step: Running via Dask, you can connect to the Dask dashboard through a browser and look at the status there. If using the multiprocessing script, there's already a progress bar from tqdm.
  • The ctx and AUCell steps have a tqdm progress bar (in both CLI and interactive use, I believe)

But maybe you could tell me your specific use case and what kind of progress tracking would help you?

@MaybeJustJames
Copy link
Author

This is filed on behalf of @saeedfc. Could you please respond to this @saeedfc?

@saeedfc
Copy link

saeedfc commented Sep 28, 2020

Hi @MaybeJustJames ,

Thanks for the suggestion. pySCENIC does actually have process tracking built in for a number of steps, although it's maybe not always obvious.

  • GRN step: Running via Dask, you can connect to the Dask dashboard through a browser and look at the status there. If using the multiprocessing script, there's already a progress bar from tqdm.
  • The ctx and AUCell steps have a tqdm progress bar (in both CLI and interactive use, I believe)

But maybe you could tell me your specific use case and what kind of progress tracking would help you?

Hi @cflerin

I had a situation when I used your standard pipeline using dask. I had a dataset of 13k cells from 10x of human samples. Usually, that kind of data never takes more than 20-30 hours for me. However, this one took 5-6 days and still never finished computing the adjacencies.
I was not sure whether it is just taking longer time or whether it was a dask issue.

Especially because I was troubleshooting the problems with dask and trying the solutions discussed in other threads here.
I finally had to kill the process.

Here is what I tried.

import os
import glob
import pickle
import pandas as pd
import numpy as np

from dask.diagnostics import ProgressBar

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
from pyscenic.aucell import derive_auc_threshold
from pyscenic.binarization import binarize
import seaborn as sns

if __name__ == '__main__':
    DATA_FOLDER = '/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells'
    RESOURCES_FOLDER ='/mnt/DATA1/Fibrosis/Human Integration and Clustering/COVID/Epithelial Cells/SCENIC/RESOURCES_FOLDER'
    DATABASES_GLOB = os.path.join(RESOURCES_FOLDER, "hg38*.mc9nr.feather")
    MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.hgnc-m0.001-o0.0.tbl")
    MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'TFs.txt')
    REGULONS_FNAME = os.path.join(DATA_FOLDER, "Regulons_Myeloid.p")
    MOTIFS_FNAME = os.path.join(DATA_FOLDER, "Regulons_motifs_Myeloid.csv")
    ex_matrix = pd.read_csv("/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells/myeloid_expression.csv", sep = ",", header=0, index_col=0)
    ex_matrix.shape
    tf_names = load_tf_names(MM_TFS_FNAME)
    db_fnames = glob.glob(DATABASES_GLOB)
    
    def name(fname):
        return os.path.basename(fname).split(".")[0]
    dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
    dbs
    adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True, seed = 777)
    adjacencies.to_csv("/mnt/DATA1/Fibrosis/Full Scale Analysis/SCENIC/Myeloid Cells/Adjacencies_Myeloid.csv", index = False, sep = '\t')

Below is what I have on screen and it was there for 5-6 days.

preparing dask client
parsing input
/home/luna.kuleuven.be/u0119129/anaconda3/lib/python3.7/site-packages/arboreto/algo.py:214: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  expression_matrix = expression_data.as_matrix()
creating dask graph
6 partitions
computing dask graph

I see some documentation here https://docs.dask.org/en/latest/diagnostics-distributed.html about tracking. But maybe you can give a guideline how we can track this on a local machine.
Simple giving as below in the beginning is fine?

from dask.distributed import Client
client = Client()  # start distributed scheduler locally.  Launch dashboard

Thanks a lot for your time.
@MaybeJustJames and @cflerin

@cflerin
Copy link
Contributor

cflerin commented Sep 30, 2020

Ok, thanks for describing your workflow, @saeedfc. If you want to monitor the Dask progress, the first thing I would suggest is to check out the tutorial in Arboreto describing how to connect to the Dask scheduler.

Otherwise, I see what you mean about progress reporting to the command prompt but I don't know if we change that really. One thing I would suggest if you're having problems with the GRN step is to try the multiprocessing script, which is more stable, and gives more a informative progress report.

@saeedfc
Copy link

saeedfc commented Oct 1, 2020

Thank you @cflerin . I shall try the multiprocessing script and maybe the dask scheduler as well.

Thanks and Kind Regards,
Saeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants