diff --git a/docs/hello_nextflow/01_hello_world.md b/docs/hello_nextflow/01_hello_world.md index 1bb8786a..9bca9060 100644 --- a/docs/hello_nextflow/01_hello_world.md +++ b/docs/hello_nextflow/01_hello_world.md @@ -328,12 +328,6 @@ Nextflow is in control of this directory and we are not supposed to interact wit Let's look at how to use the `publishDir` directive for managing this more conveniently. -!!! note - - A newer syntax option had been proposed to make it possible to declare and publish workflow-level outputs, documented [here](https://www.nextflow.io/docs/latest/workflow.html#publishing-outputs). - This will eventually make using `publishDir` at the process level redundant for completed pipelines. - However, we expect that `publishDir` will still remain very useful during pipeline development. - ### 4.1. Add a `publishDir` directive to the process To make the output file more accessible, we can utilize the `publishDir` directive. @@ -382,11 +376,15 @@ Our `output.txt` file is in this directory. If you check the contents it should match the output in our work/task directory. This is how we move results files outside of the working directories. +It is also possible to set the `publishDir` directive to make a symbolic link to the file instead of actually copying it. +This is useful when you're dealing with very large files. +However, if you delete the work directory as part of a cleanup operation, you will lost access to the file, so always make sure you have actual copies of everything you care about before deleting anything. + !!! note - It is also possible to set the `publishDir` directive to make a symbolic link to the file instead of actually copying it. - This is useful when you're dealing with very large files. - However, if you delete the work directory as part of a cleanup operation, you will lost access to the file, so always make sure you have actual copies of everything you care about before deleting anything. + A newer syntax option had been proposed to make it possible to declare and publish workflow-level outputs, documented [here](https://www.nextflow.io/docs/latest/workflow.html#publishing-outputs). + This will eventually make using `publishDir` at the process level redundant for completed pipelines. + However, we expect that `publishDir` will still remain very useful during pipeline development. ### Takeaway @@ -396,4 +394,14 @@ More generally, you know how to interpret a simple Nextflow workflow, manage its ### What's next? -[TODO] LEARN HOW TO USE CHANNELS TO PROVIDE INPUTS TO A PROCESS +[TODO] + +More generally, you've learned how to use the essential components of Nextflow and you have a basic grasp of the logic of how to build a workflow and retrieve the desired outputs. + +### What's next? + +Take a break! + +[TODO] + +When you're ready, move on to Part X to learn about [TODO]. diff --git a/docs/hello_nextflow/02_hello_channels.md b/docs/hello_nextflow/02_hello_channels.md index 2ff6713f..28de60da 100644 --- a/docs/hello_nextflow/02_hello_channels.md +++ b/docs/hello_nextflow/02_hello_channels.md @@ -281,6 +281,10 @@ process sayHello { } ``` +!!! tip + + You MUST use double quotes around the output filename expression (NOT single quotes), otherwise it will fail. + This should produce a unique output file name for every call of each process. ### 2.5. Run the workflow and look at the results directory @@ -330,155 +334,357 @@ Learn how to make the workflow take a file as its source of input values. --- -## 3. Modify the workflow to take a file as its source of input values +## 3. Use CLI parameters to supply input values -It's often the case that, when we want to run on a batch of multiple input elements, the input values are contained in a file. -As an example, we have provided you with a CSV file called `greetings.csv` in the `data/` directory, containing several greetings separated by commas. +We want to be able to specify the input from the command line, since that is the piece that will almost always be different in subsequent runs of the workflow. +Good news: Nextflow has a built-in workflow parameter system called `params`, which makes it easy to declare and use CLI parameters. -```csv title="greetings.csv" -Hello,Bonjour,Holà +### 3.1. Edit the input channel declaration to use a parameter + +Here we replace the hardcoded input strings with `params.greeting` in the channel creation line. + +_Before:_ + +```groovy title="hello-world.nf" linenums="23" +// create a channel for inputs +greeting_ch = Channel.of('Hello','Bonjour','Holà') ``` -So we just need to modify our workflow to read in the values from a file like that. +_After:_ -### 3.1. Update the channel declaration to use the input file +```groovy title="hello-world.nf" linenums="23" +// create a channel for inputs +greeting_ch = Channel.of(params.greeting) +``` -Since we now want to use a file instead of a simple value as the input, we can't use the `of()` channel factory from before. -We need to switch to using a new channel factory, `fromPath()`, which has some built-in functionality for handling file paths. +This automatically creates a parameter called `greeting` that you can use to provide a value in the command line. + +### 3.2. Run the workflow again with the `--greeting` parameter + +To provide a value for this parameter, simply add `--greeting ` to your command line. Let's start with using a single value. + +```bash +nextflow run hello-world.nf --greeting 'Bonjour le monde!' +``` + +Running this should feel extremely familiar by now. + +```console title="Output" + N E X T F L O W ~ version 24.10.0 + + ┃ Launching `hello-world.nf` [cheesy_engelbart] DSL2 - revision: b58b6ab94b + +executor > local (1) +[1c/9b6dc9] sayHello (1) [100%] 1 of 1 ✔ +``` + +Be sure to open up the output file to check that you now have the new version of the greeting. Voilà! + +!!! tip + + It's helpful to distinguish Nextflow-level parameters from pipeline-level parameters. + For parameters that apply to a pipeline, we use a double hyphen (`--`), whereas we use a single hyphen (`-`) for parameters that modify a specific Nextflow setting, _e.g._ the `-resume` feature we used earlier. + +### 3.3. Set a default value for a command line parameter + +In many cases, it makes sense to supply a default value for a given parameter so that you don't have to specify it for every run. + +Let's initialize the `greeting` parameter with a default value by adding the parameter declaration before the workflow definition (with a comment block as a free bonus). + +```groovy title="hello-world.nf" linenums="3" +/* + * Pipeline parameters + */ +params.greeting = 'Holà mundo!' +``` + +!!! tip + + You can put the parameter declaration inside the workflow block if you prefer. Whatever you choose, try to group similar things in the same place so you don't end up with declarations all over the place. + +### 3.4. Run the workflow again without specifying the parameter + +Now that you have a default value set, you can run the workflow again without having to specify a value in the command line. + +```bash +nextflow run hello-world.nf +``` + +The console output should look the same. + +```console title="Output" + N E X T F L O W ~ version 24.10.0 + + ┃ Launching `hello-world.nf` [wise_waddington] DSL2 - revision: 988fc779cf + +executor > local (1) +[c0/8b8332] sayHello (1) [100%] 1 of 1 ✔ +``` + +Check the output in the results directory, and... Tadaa! It works! +Nextflow used the default value to name the output. + +!!! note + + If you provide the parameter on the command line, the CLI value will override the default value. Feel free to test this out. + + ```bash + nextflow run hello-world.nf --greeting 'Konnichiwa!' + ``` + + In Nextflow, there are multiple places where you can specify values for parameters. + If the same parameter is set to different values in multiple places, Nexflow will determine what value to use based on the order of precedence that is described [here](https://www.nextflow.io/docs/latest/config.html). + +### Takeaway + +You know how to use CLI parameters to feed inputs to the workflow. + +### What's next? + +Learn how to make the workflow take a file as its source of input values. + +--- + +## 4. Supply a batch of multiple values via the `params` system + +We sneakily reverted to running on just one value there. +What if we want to run on a batch again, like we did earlier? + +Common sense suggests we should be able to simply pass in an array of values instead of a single value. Right? + +### 4.1. Switch the `params.greeting` value to an array of values + +[TODO] _Before:_ -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of('Hello','Bonjour','Holà') +```groovy title="hello-world.nf" linenums="23" +/* + * Pipeline parameters + */ +params.greeting = 'Holà mundo' ``` _After:_ -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') +```groovy title="hello-world.nf" linenums="23" +/* + * Pipeline parameters + */ +params.greeting = ['Holà mundo','Konnichiwa,','Dobrý den'] ``` -### 3.2. Run it [TODO] +### 4.2. Run the workflow -[TODO] BREAKS. SHOW ERROR. JUST RUNS ONCE, TRIES TO USE THE PATH ITSELF AS GREETING. NOT WHAT WE WANT. WE WANT TO READ IN THE CONTENTS OF THE FILE. +[TODO] OH NO IT DOES NOT WORK. SHOW ERROR. JUST RUNS ONCE, TRIES TO USE THE WHOLE ARRAY AS A SINGLE INPUT. NEED TO TRANSFORM HOW CONTENTS ARE ORGANIZED/PACKAGED IN THE CHANNEL -### 3.3. Add `splitCsv()` operator [TODO] +### 4.3. Use the `flatten()` operator -[TODO] INTRODUCE CONCEPT OF OPERATORS. "You can think of them as ways of transforming the contents of a channel in a variety of ways." LOOK AT OPERATOR DOCS, FIND SPLITCSV: "an 'operator' to transform that CSV file into channel contents". +[TODO] INTRODUCE CONCEPT OF OPERATORS. "You can think of them as ways of transforming the contents of a channel in a variety of ways." -To apply the operator, add it to the channel construction instruction: +[TODO] LOOK AT DOCS, FIND FLATTEN. ADD TO CHANNEL CONSTRUCTION LIKE THIS _Before:_ ```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') +// create a channel for inputs +greeting_ch = Channel.of(params.greeting) ``` _After:_ ```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') - .splitCsv() +// create a channel for inputs +greeting_ch = Channel.of(params.greeting) + .flatten() ``` -### 3.2. Run it [TODO] +### 4.4. Add `view()` to inspect channel contents [TODO] -[TODO] BREAKS AGAIN. SHOW ERROR. COULD DIG AROUND WORK DIRECTORY BUT SCREW IT WE NEED TO GET SOME CLARITY ON WHAT THE CHANNEL CONTENTS LOOK LIKE. +We can inspect how each operator changes how the contents of a channel are organized using the `.view()` operator: -### 3.3. Use `view()` to inspect channel contents [TODO] +_Before:_ -We can inspect how each operator changes how the contents of a channel are organized using the `.view()` operator: +```groovy title="hello-world.nf" linenums="46" +// create a channel for inputs +greeting_ch = Channel.of(params.greeting) + .flatten() +``` + +_After:_ + +```groovy title="hello-world.nf" linenums="46" +// create a channel for inputs +greeting_ch = Channel.of(params.greeting) + .view{ "Before flatten: $it" } + .flatten() + .view{ "After flatten: $it" } +``` + +### 4.5. Run the workflow [TODO] + +[TODO] YAY IT WORKS AND ALSO THE VIEW STATEMENTS SHOW US WHAT'S HAPPENING + +As you can see from the view() statements, the `flatten()` operator has transformed the channel from containing arrays to containing individual elements. This can be useful when you want to process each item separately in your workflow. + +!!! tip + + You can delete or comment out the `view()` statements before moving on. + + ```groovy title="hello-world.nf" linenums="46" + // create a channel for inputs + greeting_ch = Channel.of(params.greeting) + .flatten() + ``` + +### Takeaway + +You know how to use the flatten() operator to handle a batch of values passed in through the CLI parameter system, and use the view() directive to inspect channel contents before and after applying the operators. + +### What's next? + +Learn how to make the workflow take a file as its source of input values. + +--- + +## 5. Modify the workflow to take a file as its source of input values + +It's often the case that, when we want to run on a batch of multiple input elements, the input values are contained in a file. +As an example, we have provided you with a CSV file called `greetings.csv` in the `data/` directory, containing several greetings separated by commas. + +```csv title="greetings.csv" +Hello,Bonjour,Holà +``` + +So we need to modify our workflow to read in the values from a file like that. + +### 5.1. Switch the `params.greeting` to the CSV file + +_Before:_ + +```groovy title="hello-world.nf" linenums="23" +/* + * Pipeline parameters + */ +params.greeting = ['Holà mundo','Konnichiwa,','Dobrý den'] +``` + +_After:_ + +```groovy title="hello-world.nf" linenums="23" +/* + * Pipeline parameters + */ +params.greeting = 'data/greetings.csv' +``` + +### 5.1. Update the channel declaration to use the input file + +Since we now want to use a file instead of a simple value as the input, we can't use the `of()` channel factory from before. +We need to switch to using a new channel factory, `fromPath()`, which has some built-in functionality for handling file paths. _Before:_ ```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') - .splitCsv() +// create a channel for inputs +greeting_ch = Channel.of(params.greeting) + .flatten() ``` _After:_ ```groovy title="hello-world.nf" linenums="46" // create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') - .splitCsv() - .view{ "After splitCsv: $it" } +greeting_ch = Channel.fromPath(params.greeting) ``` -### 3.4. Run again [TODO] +### 5.2. Run it [TODO] -[TODO] SHOW OUPUT, POINT OUT VIEW OUTPUT. CONFIRMS IT PASSED ALL THE ITEMS TOGETHER AS ONE ARRAY ELEMENT (INDICATED BY BRACKETS). BRACKETS IN THE OUTPUT FILE BREAK THE ECHO COMMAND. ANYWAY THIS IS STILL NOT WHAT WE WANT. WE WANT TO BREAK UP THE PACKAGE FOR THE GREETINGS TO BE USED AS SEPARATE INPUT ITEMS. +[TODO] OH NO IT BREAKS. SHOW ERROR. JUST RUNS ONCE, TRIES TO USE THE PATH ITSELF AS GREETING. NOT WHAT WE WANT. WE WANT TO READ IN THE CONTENTS OF THE FILE. SOUNDS LIKE WE NEED ANOTHER OPERATOR! -### 3.5. Add `flatten()` operator [TODO] +### 5.3. Add `splitCsv()` operator [TODO] -[TODO] DOCS SAY THE FLATTEN OPERATOR DOES X +[TODO] LOOK AT OPERATOR DOCS, FIND SPLITCSV: "an 'operator' to transform that CSV file into channel contents". -To apply the operator, add it to the channel construction instruction. Include another view() call. +To apply the operator, add it to the channel construction instruction like previously; and we're also going to include view statements while we're at it. _Before:_ ```groovy title="hello-world.nf" linenums="46" // create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') - .splitCsv() +greeting_ch = Channel.fromPath(params.greeting) ``` _After:_ ```groovy title="hello-world.nf" linenums="46" // create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') +greeting_ch = Channel.fromPath(params.greeting) + .view{ "Before splitCsv: $it" } .splitCsv() .view{ "After splitCsv: $it" } - .flatten() - .view{ "After flatten: $it" } ``` -### 3.6. Run it [TODO] +### 5.4. Run it [TODO] -[TODO] THIS TIME IT WORKS, YAY - -[TODO] -As you can see, the `flatten()` operator has transformed the channel from containing arrays to containing individual elements. This can be useful when you want to process each item separately in your workflow. +[TODO] OH COME ON IT BREAKS AGAIN. SHOW ERROR. CAN PROBABLY ALREADY GUESS WHAT THE PROBLEM IS BUT HEY LET'S CHECK THOSE VIEW STATEMENTS. OH SEE, BRACKETS. CONFIRMS IT PASSED ALL THE ITEMS TOGETHER AS ONE ARRAY ELEMENT (INDICATED BY BRACKETS). BRACKETS IN THE OUTPUT FILE BREAK THE ECHO COMMAND. EVEN IF IT DIDN'T, THIS IS STILL NOT WHAT WE WANT. WE WANT TO BREAK UP THE PACKAGE FOR THE GREETINGS TO BE USED AS SEPARATE INPUT ITEMS. -Looking at the outputs, we see each greeting was correctly extracted and processed through the workflow. We've achieved the same result as previously, but now we have a lot more flexibility to add more elements to the channel of greetings we want to process without modifying any code. +### 5.5. Add `flatten()` operator [TODO] -[TODO] NOTE THAT IF YOU ADD MORE LINES TO THE CSV EVERYTHING GETS PARSED AS INDIVIDUAL ITEMS. THAT'S WHAT FLATTEN IS DOING. LEARN MORE ABOUT PLUMBING LATER. +[TODO] REMEMBER FLATTEN? WE LOVE FLATTEN -Remove the `.view()` operations before you continue. +To apply the operator, add it to the channel construction instruction. Include another view() call. _Before:_ ```groovy title="hello-world.nf" linenums="46" // create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') +greeting_ch = Channel.fromPath(params.greeting) + .view{ "Before splitCsv: $it" } .splitCsv() - .flatten() + .view{ "After splitCsv: $it" } ``` _After:_ ```groovy title="hello-world.nf" linenums="46" // create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath('data/greetings.csv') +greeting_ch = Channel.fromPath(params.greeting) + .view{ "Before splitCsv: $it" } .splitCsv() .view{ "After splitCsv: $it" } .flatten() .view{ "After flatten: $it" } ``` +### 5.6. Run it [TODO] + +[TODO] THIS TIME IT WORKS, YAY + +Looking at the outputs, we see each greeting was correctly extracted and processed through the workflow. We've achieved the same result as previously, but now we have a lot more flexibility to add more elements to the channel of greetings we want to process without modifying any code. + +[TODO] NOTE THAT IF YOU ADD MORE LINES TO THE CSV EVERYTHING GETS PARSED AS INDIVIDUAL ITEMS. THAT'S WHAT FLATTEN IS DOING. LEARN MORE ABOUT PLUMBING LATER. + +!!! note + + Be sure to remove the `.view()` operations before you continue. + + ```groovy title="hello-world.nf" linenums="46" + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath('data/greetings.csv') + .splitCsv() + .flatten() + ``` + ### Takeaway -[TODO] +You know how to use the splitCsv() and flatten() operators to handle a batch of values passed in through a file. + +More generally, [TODO] ### What's next? -Celebrate your success and take a break! +Take a break! Don't worry if the channel factories and operators feel like a lot to grapple with the first time you encounter them. You'll get more opportunities to practice using these components in various settings as you work through this training course. diff --git a/docs/hello_nextflow/03_hello_plumbing.md b/docs/hello_nextflow/03_hello_plumbing.md index eea4fda2..75eef64a 100644 --- a/docs/hello_nextflow/03_hello_plumbing.md +++ b/docs/hello_nextflow/03_hello_plumbing.md @@ -1,713 +1,21 @@ -# Part 1: Hello World +# Part 3: Hello Plumbing -A "Hello World!" is a minimalist example that is meant to demonstrate the basic syntax and structure of a programming language or software framework. The example typically consists of printing the phrase "Hello, World!" to the output device, such as the console or terminal, or writing it to a file. - -In this first part of the Hello Nextflow training course, we ease into the topic with a very simple domain-agnostic Hello World example, which we'll progressively build up to demonstrate the usage of foundational Nextflow logic and components. - ---- - -## 0. Warmup: Run Hello World directly - -Let's demonstrate this with a simple command that we run directly in the terminal, to show what it does before we wrap it in Nextflow. - -### 0.1. Make the terminal say hello - -```bash -echo 'Hello World!' -``` - -### 0.2. Now make it write the text output to a file - -```bash -echo 'Hello World!' > output.txt -``` - -### 0.3. Verify that the output file is there using the `ls` command - -```bash -ls -``` - -### 0.4. Show the file contents - -```bash -cat output.txt -``` - -!!! tip - - In the Gitpod environment, you can also find the output file in the file explorer, and view its contents by clicking on it. Alternatively, you can use the `code` command to open the file for viewing. - - ```bash - code output.txt - ``` - -### Takeaway - -You now know how to run a simple command in the terminal that outputs some text, and optionally, how to make it write the output to a file. - -### What's next? - -Discover what that would look like written as a Nextflow workflow. - ---- - -## 1. Try the Hello World workflow starter script - -As mentioned in the orientation, we provide you with a fully functional if minimalist workflow script named `hello-world.nf` that does the same thing as before (write out 'Hello World!') but with Nextflow. - -To get you started, we'll first open up the workflow script so you can get a sense of how it's structured, then we'll run it (before trying to make any modifications) to verify that it does what we expect. - -### 1.1. Decipher the code structure - -Let's open the `hello-world.nf` script in the editor pane. - -!!! note - - The file is in the `hello-nextflow` directory, which should be your current working directory. - You can either click on the file in the file explorer, or type `ls` in the terminal and Cmd+Click (MacOS) or Ctrl+Click (PC) on the file to open it. - -```groovy title="hello-world.nf" linenums="1" -#!/usr/bin/env nextflow - -/* - * Use echo to print 'Hello World!' to standard out - */ -process sayHello { - - output: - stdout - - script: - """ - echo 'Hello World!' - """ -} - -workflow { - - // emit a greeting - sayHello() -} -``` - -As you can see, a Nextflow script involves two main types of core components: one or more **processes**, and the **workflow** itself. -Each **process** describes what operation(s) the corresponding step in the pipeline should accomplish, while the **workflow** describes the dataflow logic that connects the various steps. - -Let's take a closer look at the **process** block first, then we'll look at the **workflow** block. - -#### 1.1.1 The `process` definition - -The first block of code describes a **process**. -The process definition starts with the keyword `process`, followed by the process name and finally the process body delimited by curly braces. -The process body must contain a script block which specifies the command to run, which can be anything you would be able to run in a command line terminal. - -Here we have a **process** called `sayHello` that writes its **output** to `stdout`. - -```groovy title="hello-world.nf" linenums="3" -/* - * Use echo to print 'Hello World!' to standard out - */ -process sayHello { - - output: - stdout - - script: - """ - echo 'Hello World!' - """ -} -``` - -This a very minimal process definition that just contains an output definition and the script itself. -In a real-world pipeline, a process usually contains additional blocks such as directives, inputs, and conditional clauses, which we'll introduce later in this training course. - -!!! note - - The output definition does not _determine_ what output will be created. - It simply _declares_ what is the expected output, so that Nextflow can look for it once execution is complete. - This is necessary for verifying that the command was executed successfully and for passing the output to downstream processes if needed. - -#### 1.1.2 The `workflow` definition - -The second block of code describes the **workflow** itself. -The workflow definition starts with the keyword `workflow`, followed by an optional name, then the workflow body delimited by curly braces. - -Here we have a **workflow** that consists of one call to the `sayHello` process. - -```groovy title="hello-world.nf" linenums="16" -workflow { - - // emit a greeting - sayHello() -} -``` - -This a very minimal **workflow** definition. -In a real-world pipeline, the workflow typically contains multiple calls to **processes** connected by **channels**. -You'll learn how to add more processes and connect them by channels in a little bit. - -### 1.2. Run the workflow - -Looking at code is not nearly as fun as running it, so let's try this out in practice. - -```bash -nextflow run hello-world.nf -``` - -You console output should look something like this: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [reverent_carson] DSL2 - revision: 463b611a35 - -executor > local (1) -[1c/7d08e6] sayHello [100%] 1 of 1 ✔ -``` - -Congratulations, you just ran your first Nextflow workflow! - -The most important output here is the last line (line 6), which reports that the `sayHello` process was successfully executed once. - -Okay, that's great, but where do we find the output? -The `sayHello` process definition said that the output would be sent to standard out, but nothing got printed in the console, did it? - -### 1.3. Find the output and logs in the `work` directory - -When you run Nextflow for the first time in a given directory, it creates a directory called `work` where it will write all files (and symlinks) generated in the course of execution. -Have a look inside; you'll find a subdirectory named with a hash (in order to make it unique; we'll discuss why in a bit), nested two levels deep and containing a handful of log files. - -!!! tip - - If you browse the contents of the task subdirectory in the Gitpod's VSCode file explorer, you'll see all these files right away. - However, these files are set to be invisible in the terminal, so if you want to use `ls` or `tree` to view them, you'll need to set the relevant option for displaying invisible files. - - ```bash - tree -a work - ``` - - You should see something like this, though the exact subdirectory names will be different on your system. - - ```console title="Directory contents" - work - └── 1c - └── 7d08e685a7aa7060b9c21667924824 - ├── .command.begin - ├── .command.err - ├── .command.log - ├── .command.out - ├── .command.run - ├── .command.sh - └── .exitcode - ``` - -You may have noticed that the subdirectory names appeared (in truncated form) in the output from the workflow run, in the line that says: - -```console title="Output" -[1c/7d08e6] sayHello [100%] 1 of 1 ✔ -``` - -This tells you what is the subdirectory path for that specific process call (sometimes called task). - -!!! note - - Nextflow creates a separate unique subdirectory for each process call. - It stages the relevant input files, script, and other helper files there, and writes any output files and logs there as well. - -If we look inside the subdirectory, we find the following log files: - -- **`.command.begin`**: Metadata related to the beginning of the execution of the process task -- **`.command.err`**: Error messages (stderr) emitted by the process task -- **`.command.log`**: Complete log output emitted by the process task -- **`.command.out`**: Regular output (stdout) by the process task -- **`.command.sh`**: The command that was run by the process task call -- **`.exitcode`**: The exit code resulting from the command - -In this case, you can look for your output in the `.command.out` file, since that's where stdout output is captured. -If you open it, you'll find the `Hello World!` greeting, which was the expected result of our minimalist workflow. - -It's also worth having a look at the `.command.sh` file, which tells you what command Nextflow actually executed. In this case it's very straightforward, but later in the course you'll see commands that involve some interpolation of variables. When you're dealing with that, you need to be able to check exactly what was run, especially when troubleshooting an issue. - -### Takeaway - -You know how to decipher a simple Nextflow script, run it and find the output and logs in the work directory. - -### What's next? - -Learn how to make the script output a named file. - ---- - -## 3. Send the output to a file - -Instead of printing "Hello World!" to standard output, we'd prefer to save that output to a specific file, just like we did when running in the terminal earlier. -This is how most tools that you'll run as part of real-world pipelines typically behave; we'll see examples of that later. - -To achieve this result, both the script and the output definition blocks need to be updated. - -### 3.1. Change the process command to output a named file - -This is the same change we made when we ran the command directly in the terminal earlier. - -_Before:_ - -```groovy title="hello-world.nf" linenums="11" -""" -echo 'Hello World!' -""" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="11" -""" -echo 'Hello World!' > output.txt -""" -``` - -### 3.2. Change the output declaration in the `sayHello` process - -We need to tell Nextflow that it should now look for a specific file to be produced by the process execution. - -_Before:_ - -```groovy title="hello-world.nf" linenums="8" -output: - stdout -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="8" -output: - path 'output.txt' -``` - -!!! note - - Inputs and outputs in the process blocks typically require a qualifier and a variable name: - - ``` - - ``` - - The qualifier defines the type of data to be received. - This information is used by Nextflow to apply the semantic rules associated with each qualifier, and handle it properly. - Common qualifiers include `val` and `path`. - In the example above, `stdout` is an exception since it is not associated with a name. - -### 3.3. Run the workflow again - -```bash -nextflow run hello-world.nf -``` - -The log output should be very similar to the first time your ran the workflow: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [cranky_sinoussi] DSL2 - revision: 30b437bb96 - -executor > local (1) -[7a/6bd54c] sayHello [100%] 1 of 1 ✔ -``` - -Like you did before, find the `work` directory in the file explorer. -There, find the `output.txt` output file and click on it to open it, and verify that it contains the greeting as expected. - -!!! warning - - This example is brittle because we hardcoded the output filename in two separate places (the script and the output blocks). - If we change one but not the other, the script will break. - Later, you'll learn how to use variables to avoid this problem. - -### 3.4. Add a `publishDir` directive to the process - -You'll have noticed that the output is buried in a working directory several layers deep. -Nextflow is in control of this directory and we are not supposed to interact with it. -To make the output file more accessible, we can utilize the `publishDir` directive. -By specifying this directive, we are telling Nextflow to automatically copy the output file to a designated output directory. -This allows us to leave the working directory alone, while still having easy access to the desired output file. - -_Before:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - output: - path 'output.txt' -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - publishDir 'results', mode: 'copy' - - output: - path 'output.txt' -``` - -!!! note - - There is a newer syntax option that makes it possible to declare and publish workflow-level outputs, documented [here](https://www.nextflow.io/docs/latest/workflow.html#publishing-outputs), which makes using `publishDir` at the process level redundant once your pipeline is fully operational. - However, `publishDir` is still very useful during pipeline development; that is why we include it in this training series. - This will also ensure that you can read and understand the large number of pipelines that have already been written with `publishDir`. - - You'll learn how to use the workflow-level outputs syntax later in this training series. - -### 3.5. Run the workflow again - -```bash -nextflow run hello-world.nf -``` - -The log output should start looking very familiar: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [mighty_lovelace] DSL2 - revision: 6654bc1327 - -executor > local (1) -[10/15498d] sayHello [100%] 1 of 1 ✔ -``` - -This time, Nextflow will have created a new directory called `results/`. -In this directory is our `output.txt` file. -If you check the contents it should match the output in our work/task directory. -This is how we move results files outside of the working directories. - -### Takeaway - -You know how to send outputs to a specific named file and use the `publishDir` directive to move files outside of the Nextflow working directory. - -### What's next? - -Learn how to make Nextflow resume running a pipeline using cached results from a prior run to skip any steps it had already completed successfully. - ---- - -## 4. Use the Nextflow resume feature - -Nextflow has an option called `-resume` that allows you to re-run a pipeline you've already launched previously. -When launched with `-resume` any processes that have already been run with the exact same code, settings and inputs will be skipped. -Using this mode means Nextflow will only run processes that are either new, have been modified or are being provided new settings or inputs. - -There are two key advantages to doing this: - -- If you're in the middle of developing your pipeline, you can iterate more rapidly since you only effectively have to run the process(es) you're actively working on in order to test your changes. -- If you're running a pipeline in production and something goes wrong, in many cases you can fix the issue and relaunch the pipeline, and it will resume running from the point of failure, which can save you a lot of time and compute. - -### 4.1. Run the workflow again with `-resume` - -```bash -nextflow run hello-world.nf -resume -``` - -The console output should look similar. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [thirsty_gautier] DSL2 - revision: 6654bc1327 - -[10/15498d] sayHello [100%] 1 of 1, cached: 1 ✔ -``` - -Notice the additional `cached:` bit in the process status line, which means that Nextflow has recognized that it has already done this work and simply re-used the result from the last run. - -!!! note - - When your re-run a pipeline with `resume`, Nextflow does not overwrite any files written to a publishDir directory by any process call that was previously run successfully. - -### Takeaway - -You know how to to relaunch a pipeline without repeating steps that were already run in an identical way. - -### What's next? - -Learn how to add in variable inputs. - ---- - -## 5. Add in variable inputs using a channel - -So far, we've been emitting a greeting hardcoded into the process command. -Now we're going to add some flexibility by using an input variable, so that we can easily change the greeting. - -This requires us to make a series of inter-related changes: - -1. Tell the process about expected variable inputs using the `input:` block -2. Edit the process to use the input -3. Create a **channel** to pass input to the process (more on that in a minute) -4. Add the channel as input to the process call - -### 5.1. Add an input definition to the process block - -First we need to adapt the process definition to accept an input. - -_Before:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - publishDir 'results', mode: 'copy' - - output: - path "output.txt" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - publishDir 'results', mode: 'copy' - - input: - val greeting - - output: - path "output.txt" -``` - -### 5.2. Edit the process command to use the input variable - -Now we swap the original hardcoded value for the input variable. - -_Before:_ - -```groovy title="hello-world.nf" linenums="16" -""" -echo 'Hello World!' > output.txt -""" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="16" -""" -echo '$greeting' > output.txt -""" -``` - -### 5.3. Create an input channel - -Now that our process expects an input, we need to set up that input in the workflow body. -This is where channels come in: Nextflow uses channels to feed inputs to processes and ferry data between processes that are connected together. - -There are multiple ways to do this, but for now, we're just going to use the simplest possible channel, containing a single value. - -We're going to create the channel using the `of()` channel factory, which sets up a simple value channel, and give it a hardcoded string to use as greeting by declaring `greeting_ch = Channel.of('Hello world!')`. - -_Before:_ - -```groovy title="hello-world.nf" linenums="21" -workflow { - - // emit a greeting - sayHello() -} -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="21" -workflow { - - // create a channel for inputs - greeting_ch = Channel.of('Hello world!') - - // emit a greeting - sayHello() -} -``` - -### 5.4. Add the channel as input to the process call - -Now we need to actually plug our newly created channel into the `sayHello()` process call. - -_Before:_ - -```groovy title="hello-world.nf" linenums="26" -// emit a greeting -sayHello() -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="26" -// emit a greeting -sayHello(greeting_ch) -``` - -### 5.5. Run the workflow command again - -Let's run it! - -```bash -nextflow run hello-world.nf -``` - -If you made all four edits correctly, you should get another successful execution: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [prickly_avogadro] DSL2 - revision: b58b6ab94b - -executor > local (1) -[1f/50efd5] sayHello (1) [100%] 1 of 1 ✔ -``` - -Feel free to check the results directory to satisfy yourself that the outcome is still the same as previously; so far we're just progressively tweaking the internal plumbing to increase the flexibility of our workflow while achieving the same end result. - -### Takeaway - -You know how to use a simple channel to provide an input to a process. - -### What's next? - -Learn how to pass inputs from the command line. - ---- - -## 6. Use CLI parameters for inputs - -We want to be able to specify the input from the command line, since that is the piece that will almost always be different in subsequent runs of the workflow. -Good news: Nextflow has a built-in workflow parameter system called `params`, which makes it easy to declare and use CLI parameters. - -### 6.1. Edit the input channel declaration to use a parameter - -Here we swap out the hardcoded string for `params.greeting` in the channel creation line. - -_Before:_ - -```groovy title="hello-world.nf" linenums="23" -// create a channel for inputs -greeting_ch = Channel.of('Hello world!') -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="23" -// create a channel for inputs -greeting_ch = Channel.of(params.greeting) -``` - -This automatically creates a parameter called `greeting` that you can use to provide a value in the command line. - -### 6.2. Run the workflow again with the `--greeting` parameter - -To provide a value for this parameter, simply add `--greeting ` to your command line. - -```bash -nextflow run hello-world.nf --greeting 'Bonjour le monde!' -``` - -Running this should feel extremely familiar by now. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [cheesy_engelbart] DSL2 - revision: b58b6ab94b - -executor > local (1) -[1c/9b6dc9] sayHello (1) [100%] 1 of 1 ✔ -``` - -Be sure to open up the output file to check that you now have the new version of the greeting. Voilà! - -!!! tip - - It's helpful to distinguish Nextflow-level parameters from pipeline-level parameters. - For parameters that apply to a pipeline, we use a double hyphen (`--`), whereas we use a single hyphen (`-`) for parameters that modify a specific Nextflow setting, _e.g._ the `-resume` feature we used earlier. - -### 6.3. Set a default value for a command line parameter - -In many cases, it makes sense to supply a default value for a given parameter so that you don't have to specify it for every run. - -Let's initialize the `greeting` parameter with a default value by adding the parameter declaration at the top of the script (with a comment block as a free bonus). - -```groovy title="hello-world.nf" linenums="3" -/* - * Pipeline parameters - */ -params.greeting = "Holà mundo!" -``` - -### 6.4. Run the workflow again without specifying the parameter - -Now that you have a default value set, you can run the workflow again without having to specify a value in the command line. - -```bash -nextflow run hello-world.nf -``` - -The output should look the same. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [wise_waddington] DSL2 - revision: 988fc779cf - -executor > local (1) -[c0/8b8332] sayHello (1) [100%] 1 of 1 ✔ -``` - -Check the output in the results directory, and... Tadaa! It works! Nextflow used the default value to name the output. But wait, what happens now if we provide the parameter in the command line? - -### 6.5. Run the workflow again with the `--greeting` parameter on the command line using a different greeting - -```bash -nextflow run hello-world.nf --greeting 'Konnichiwa!' -``` - -Nextflow's not complaining, that's a good sign: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [prickly_miescher] DSL2 - revision: 988fc779cf - -executor > local (1) -[56/f88a56] sayHello (1) [100%] 1 of 1 ✔ -``` - -Check the results directory and look at the contents of `output.txt`. Tadaa again! - -The value of the parameter we passed on the command line overrode the value we gave the variable in the script. In fact, parameters can be set in several different ways; if the same parameter is set in multiple places, its value is determined based on the order of precedence that is described [here](https://www.nextflow.io/docs/latest/config.html). - -!!! tip - - You can put the parameter declaration inside the workflow block if you prefer. Whatever you choose, try to group similar things in the same place so you don't end up with declarations all over the place. - -### Takeaway - -You know how to set up an input variable for a process and supply a value in the command line. - -### What's next? - -Learn how to add in a second process and chain them together. +[TODO] --- -## 7. Add a second step to the workflow +## 0. Warmup: Run in terminal +[TODO] Most real-world workflows involve more than one step. Here we introduce a second process that converts the text to uppercase (all-caps), using the classic UNIX one-liner: ```bash tr '[a-z]' '[A-Z]' ``` -We're going to run the command by itself in the terminal first to verify that it works as expected without any of the workflow code getting in the way of clarity, just like we did at the start with `echo 'Hello World'`. Then we'll write a process that does the same thing, and finally we'll connect the two processes so the output of the first serves as input to the second. +We're going to run the command by itself in the terminal first to verify that it works as expected without any of the workflow code getting in the way of clarity, just like we did at the start with `echo 'Hello World'`. -### 7.1. Run the command in the terminal by itself +### 0.1. Run the command in the terminal by itself ```bash echo 'Hello World' | tr '[a-z]' '[A-Z]' @@ -723,7 +31,7 @@ HELLO WORLD This is a very naive text replacement one-liner that does not account for accented letters, so for example 'Holà' will become 'HOLà'. This is expected. -### 7.2. Make the command take a file as input and write the output to a file +### 0.2. Make the command take a file as input and write the output to a file As previously, we want to output results to a dedicated file, which we name by prepending the original filename with `UPPER-`. @@ -733,11 +41,19 @@ cat output.txt | tr '[a-z]' '[A-Z]' > UPPER-output.txt Now the `HELLO WORLD` output is in the new output file, `UPPER-output.txt`. -### 7.3. Wrap the command in a new Nextflow process definition +--- + +## 1. Add a second step to the workflow + +[TODO] INTRO + +We're going to write a process that wraps the command we just ran in the terminal, then we'll add it to the workflow, setting it up to take the output of the `sayHello()` process as input. + +### 1.1. Wrap the command in a new Nextflow process definition We can model our new process on the first one, since we want to use all the same components. -```groovy title="hello-world.nf" linenums="26" +```groovy title="hello-plumbing.nf" linenums="26" /* * Use a text replace utility to convert the greeting to uppercase */ @@ -758,21 +74,23 @@ process convertToUpper { } ``` -As a little bonus, here we composed the second output filename based on the first one. +Similarly to what we did for the output of the first process, we compose the second output filename based on the input filename. !!! tip - Very important to remember: you have to use double quotes around the output filename expression (NOT single quotes) or it will fail. + You MUST use double quotes around the output filename expression (NOT single quotes), otherwise it will fail. -### 7.4. Add a call to the new process in the workflow body +### 1.2. Add a call to the new process in the workflow body Don't forget we need to tell Nextflow to actually call the process we just created! To do that, we add it to the `workflow` body. -```groovy title="hello-world.nf" linenums="44" +```groovy title="hello-plumbing.nf" linenums="44" workflow { - // create a channel for inputs - greeting_ch = Channel.of(params.greeting) + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath('data/greetings.csv') + .splitCsv() + .flatten() // emit a greeting sayHello(greeting_ch) @@ -782,25 +100,25 @@ workflow { } ``` -Looking good! But we still need to wire up the `convertToUpper` process call to run on the output of `sayHello`. +Looking good! Now we just need to wire up the `convertToUpper` process call to run on the output of `sayHello`. -### 7.5. Pass the output of the first process to the second process +### 1.3. Pass the output of the first process to the second process -The output of the `sayHello` process is automatically packaged as a channel called `sayHello.out`, so all we need to do is pass that as the input to the `convertToUpper` process. +The output of the `sayHello` process is automatically provided as a channel called `sayHello.out`, so all we need to do is pass that as the input to the `convertToUpper` process. -```groovy title="hello-world.nf" linenums="52" +```groovy title="hello-plumbing.nf" linenums="52" // convert the greeting to uppercase convertToUpper(sayHello.out) ``` For a simple case like this, that's all we need to do to connect two processes! -### 7.6. Run the same workflow command as before +### 1.4. Run the same workflow command as before Let's make sure this works: ```bash -nextflow run hello-world.nf --greeting 'Hello World!' +nextflow run hello-plumbing.nf ``` Oh, how exciting! There is now an extra line in the log output, which corresponds to the new process we just added: @@ -808,7 +126,7 @@ Oh, how exciting! There is now an extra line in the log output, which correspond ```console title="Output" N E X T F L O W ~ version 24.10.0 - ┃ Launching `hello-world.nf` [magical_brenner] DSL2 - revision: 0e18f34798 + ┃ Launching `hello-plumbing.nf` [magical_brenner] DSL2 - revision: 0e18f34798 executor > local (2) [57/3836c0] sayHello (1) [100%] 1 of 1 ✔ @@ -822,7 +140,7 @@ By default, Nextflow uses symbolic links to stage input files whenever possible, !!! note - All we did was connect the output of `sayHello` to the input of `convertToUpper` and the two processes could be run in serial. + All we did was connect the output of `sayHello` to the input of `convertToUpper` and the two processes could be run serially. Nextflow did the hard work of handling input and output files and passing them between the two commands for us. This is the power of channels in Nextflow, doing the busywork of connecting our pipeline steps together. @@ -835,352 +153,10 @@ You know how to add a second step that takes the output of the first step as inp ### What's next? -Learn how to make the workflow run on a batch of input values. - ---- - -## 8. Modify the workflow to run on a batch of input values - -Workflows typically run on batches of inputs that are meant to be processed in bulk, so we want to upgrade the workflow to accept multiple input values. - -Conveniently, the `of()` channel factory we've been using is quite happy to accept more than one value, so we don't need to modify that at all; we just have to load more values into the channel. - -### 8.1. Load multiple greetings into the input channel - -To keep things simple, we go back to hardcoding the greetings in the channel factory instead of using a parameter for the input, but we'll improve on that shortly. - -_Before:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of(params.greeting) -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of('Hello','Bonjour','Holà') -``` - -The documentation tells us this should work. Can it really be so simple? - -### 8.2. Run the command and look at the log output - -Let's try it. - -```bash -nextflow run hello-world.nf -``` - -Well, it certainly seems to run just fine. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [lonely_pare] DSL2 - revision: b9f1d96905 - -executor > local (6) -[3d/1fe62c] sayHello (2) [100%] 3 of 3 ✔ -[86/695813] convertToUpper (3) [100%] 3 of 3 ✔ -``` - -However... This seems to indicate that '3 of 3' calls were made for each process, which is encouraging, but this only give us one subdirectory path for each. What's going on? - -By default, the ANSI logging system writes the logging from multiple calls to the same process on the same line. Fortunately, we can disable that behavior. - -### 8.3. Run the command again with the `-ansi-log false` option - -To expand the logging to display one line per process call, just add `-ansi-log false` to the command. - -```bash -nextflow run hello-world.nf -ansi-log false -``` - -This time we see all six work subdirectories listed in the output: - -```console title="Output" -N E X T F L O W ~ version 24.02.0-edge -Launching `hello-world.nf` [big_woese] DSL2 - revision: 53f20aeb70 -[62/d81e63] Submitted process > sayHello (1) -[19/507af3] Submitted process > sayHello (2) -[8a/3126e6] Submitted process > sayHello (3) -[12/48a5c6] Submitted process > convertToUpper (1) -[73/e6e746] Submitted process > convertToUpper (2) -[c5/4fedda] Submitted process > convertToUpper (3) -``` - -That's much better; at least for this number of processes. -For a complex workflow, or a large number of inputs, having the full list output to the terminal might get a bit overwhelming. - -That being said, we have another problem. If you look in the `results` directory, there are only two files: `output.txt` and `UPPER-output.txt`! - -```console title="Directory contents" -results -├── output.txt -└── UPPER-output.txt -``` - -What's up with that? Shouldn't we be expecting two files per input greeting, so six files in all? - -You may recall that we hardcoded the output file name for the first process. -This was fine as long as there was only a single call made per process, but when we start processing multiple input values and publishing the outputs into the same directory of results, it becomes a problem. -For a given process, every call produces an output with the same file name, so Nextflow just overwrites the previous output file every time a new one is produced. - -### 8.4. Ensure the output file names will be unique - -Since we're going to be publishing all the outputs to the same results directory, we need to ensure they will have unique names. -Specifically, we need to modify the first process to generate a file name dynamically so that the final file names will be unique. - -So how do we make the file names unique? A common way to do that is to use some unique piece of metadata as part of the file name. -Here, for convenience, we'll just use the greeting itself. - -_Before:_ - -```groovy title="hello-world.nf" linenums="11" -process sayHello { - - publishDir 'results', mode: 'copy' - - input: - val greeting - - output: - path "output.txt" - - script: - """ - echo '$greeting' > "output.txt" - """ -} -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="11" -process sayHello { - - publishDir 'results', mode: 'copy' - - input: - val greeting - - output: - path "${greeting}-output.txt" - - script: - """ - echo '$greeting' > '$greeting-output.txt' - """ -} -``` - -This should produce a unique output file name for every call of each process. - -### 8.5. Run the workflow and look at the results directory - -Let's run it and check that it works. - -```bash -nextflow run hello-world.nf -``` - -Reverting back to the summary view, the output looks like this again: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [jovial_mccarthy] DSL2 - revision: 53f20aeb70 - -executor > local (6) -[03/f007f2] sayHello (1) [100%] 3 of 3 ✔ -[e5/dd2890] convertToUpper (3) [100%] 3 of 3 ✔ -``` - -But more importantly, now we have six new files in addition to the two we already had in the `results` directory: - -```console title="Directory contents" -results -├── Bonjour-output.txt -├── Hello-output.txt -├── Holà-output.txt -├── output.txt -├── UPPER-Bonjour-output.txt -├── UPPER-Hello-output.txt -├── UPPER-Holà-output.txt -└── UPPER-output.txt -``` - -Success! Now we can add as many greetings as we like without worrying about output files being overwritten. - -!!! note - - In practice, naming files based on the input data itself is almost always impractical. The better way to generate dynamic filenames is to use a samplesheet contain relevant metadata (such as unique sample IDs) and create a data structure called a 'map', which we pass to processes, and from which we can grab an appropriate identifier to generate the filenames. - We'll show you how to do that later in this training course. - -### Takeaway - -You know how to feed a batch of multiple input elements through a channel. - -### What's next? - -Learn how to make the workflow take a file as its source of input values. +[TODO] --- -## 9. Modify the workflow to take a file as its source of input values - -It's often the case that, when we want to run on a batch of multiple input elements, the input values are contained in a file. -As an example, we have provided you with a CSV file called `greetings.csv` in the `data/` directory, containing several greetings separated by commas. - -```csv title="greetings.csv" -Hello,Bonjour,Holà -``` - -So we just need to modify our workflow to read in the values from a file like that. - -### 9.1. Set up a CLI parameter with a default value pointing to an input file - -First, let's use the `params` system to set up a new parameter called `input_file`, replacing the now useless `greeting` parameter, with a default value pointing to the `greetings.csv` file. - -_Before:_ - -```groovy title="hello-world.nf" linenums="6" -/* - * Pipeline parameters - */ -params.greeting = "Holà mundo!" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="6" -/* - * Pipeline parameters - */ -params.input_file = "data/greetings.csv" -``` - -### 9.2. Update the channel declaration to handle the input file - -At this point we introduce a new channel factory, `fromPath()`, which has some built-in functionality for handling file paths. -We're going to use that instead of the `of()` channel factory we used previously; the base syntax looks like this: - -```groovy title="channel construction syntax" -Channel.fromPath(params.input_file) -``` - -Now, we are going to deploy a new concept, an 'operator' to transform that CSV file into channel content. You'll learn more about operators later, but for now just understand them as ways of transforming channels in a variety of ways. - -Since our goal is to read in the contents of a `.csv` file, we're going to add the `.splitCsv()` operator to make Nextflow parse the file contents accordingly, as well as the `.flatten()` operator to turn the array element produced by `.splitCsv()` into a channel of individual elements. - -So the channel construction instruction becomes: - -```groovy title="channel construction syntax" -Channel.fromPath(params.input_file) - .splitCsv() - .flatten() -``` - -And here it is in the context of the workflow body: - -_Before:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of('Hello','Bonjour','Holà') -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath(params.input_file) - .splitCsv() - .flatten() -``` - -If you want to see the impact of `.flatten()`, we can make use of `.view()`, another operator, to demonstrate. Edit that section of code so it looks like: - -```groovy title="flatten usage" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath(params.input_file) - .splitCsv() - .view{ "After splitCsv: $it" } - .flatten() - .view{ "After flatten: $it" } -``` - -When you run this updated workflow, you'll see the difference: - -```console title="view output with and without flatten" -After splitCsv: [Hello, Bonjour, Holà] -After flatten: Hello -After flatten: Bonjour -After flatten: Holà -[d3/1a6e23] Submitted process > sayHello (3) -[8f/d9e431] Submitted process > sayHello (1) -[e7/a088af] Submitted process > sayHello (2) -[1a/776e2e] Submitted process > convertToUpper (1) -[83/fb8eba] Submitted process > convertToUpper (2) -[ee/280f93] Submitted process > convertToUpper (3) -``` - -As you can see, the `flatten()` operator has transformed the channel from containing arrays to containing individual elements. This can be useful when you want to process each item separately in your workflow. - -Remove the `.view()` operations before you continue. - -!!! tip - - While you're developing your pipeline, you can inspect the contents of any channel by adding the `.view()` operator to the name of the channel. - For example, if you add `greeting_ch.view()` anywhere in the workflow body, when you run the script, Nextflow will print the channel contents to standard out. - - You can also use this to inspect the effect of the operators. - For example, the output of `Channel.fromPath(params.input_file).splitCsv().view()` will look like this: - - ```console title="Output" - [Hello, Bonjour, Holà] - ``` - - While the output of `Channel.fromPath(params.input_file).splitCsv().flatten().view()` will look like this: - - ```console title="Output" - Hello - Bonjour - Holà - ``` - -### 9.3. Run the workflow (one last time!) - -```bash -nextflow run hello-world.nf -``` - -Once again we see each process get executed three times: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [angry_spence] DSL2 - revision: d171cc0193 - -executor > local (6) -[0e/ceb175] sayHello (2) [100%] 3 of 3 ✔ -[01/046714] convertToUpper (3) [100%] 3 of 3 ✔ -``` - -Looking at the outputs, we see each greeting was correctly extracted and processed through the workflow. We've achieved the same result as the previous step, but now we have a lot more flexibility to add more elements to the channel of greetings we want to process. - -### Takeaway - -You know how to provide the input values to the workflow via a file. - -More generally, you've learned how to use the essential components of Nextflow and you have a basic grasp of the logic of how to build a workflow and manage inputs and outputs. - -### What's next? - -Celebrate your success and take a break! - -Don't worry if the channel types and operators feel like a lot to grapple with the first time you encounter them. -You'll get more opportunities to practice using these components in various settings as you work through this training course. +## 2. [TODO] ADD A COLLECT STEP? -When you're ready, move on to Part 2 to learn about another important concept: provisioning the software required for each process. +## 3. [TODO] MAKE THE COLLECT OPTIONAL? diff --git a/docs/hello_nextflow/04_hello_modules.md b/docs/hello_nextflow/04_hello_modules.md new file mode 100644 index 00000000..a665449f --- /dev/null +++ b/docs/hello_nextflow/04_hello_modules.md @@ -0,0 +1,169 @@ +# Part 4: Hello Modules + +This section covers how to organize your workflow code to make development and maintenance of your pipeline more efficient and sustainable. +Specifically, we are going to demonstrate how to use **modules**. + +In Nextflow, a **module** is a single process definition that is encapsulated by itself in a standalone code file. +To use a module in a workflow, you just add a single-line import statement to your workflow code file; then you can integrate the process into the workflow the same way you normally would. + +When we started developing our workflow, we put everything in one single code file. + +Putting processes into individual modules makes it possible to reuse process definitions in multiple workflows without producing multiple copies of the code. +This makes the code more shareable, flexible and maintainable. + +!!!note + + It is also possible to encapsulate a section of a workflow as a 'subworkflow' that can be imported into a larger pipeline, but that is outside the scope of this training. + +--- + +## 0. Warmup [TODO] + +[TODO] Run hello-modules to verify that it works + +--- + +## 1. Create a module for the `sayHello()` process + +[TODO] + +### 1.1. Create a directory to store modules + +### 1.2. Create a file stub for the new module + +Let's create an empty file for the module called `sayHello.nf`. + +```bash +touch modules/local/sayHello.nf +``` + +This gives us a place to put the process code. + +### 1.3. Move the `sayHello` process code to the module file + +Copy the whole process definition over from the workflow file to the module file, making sure to copy over the `#!/usr/bin/env nextflow` shebang too. + +```groovy title="modules/local/sayHello.nf" linenums="1" +#!/usr/bin/env nextflow + +[TODO] +``` + +Once that is done, delete the process definition from the workflow file, but make sure to leave the shebang in place. + +### 1.4. Add an import declaration before the workflow block + +The syntax for importing a local module is fairly straightforward: + +```groovy title="Import declaration syntax" +include { } from '' +``` + +Let's insert that above the workflow block and fill it out appropriately. + +_Before:_ + +```groovy title="hello-modules.nf" linenums="73" +workflow { +``` + +_After:_ + +```groovy title="hello-modules.nf" linenums="73" +// Include modules +include { sayHello } from './modules/local/sayHello.nf' + +workflow { +``` + +### 1.5. Run the workflow to verify that it does the same thing as before + +We're running the workflow with essentially the same code and inputs as before, so let's add the `-resume` flag and see what happens. + +```bash +nextflow run hello-modules.nf -resume +``` + +Sure enough, Nextflow recognizes that it's still all the same work to be done, even if the code is split up into multiple files. + +```console title="Output" +[TODO] +``` + +So modularizing the code in the course of development does not break resumability! + +### Takeaway + +You know how to extract a process into a local module. + +### What's next? + +Practice making more modules. + +--- + +## 2. Repeat procedure for the remaining processes + +Once you've done one, you can do a million modules... +But let's just do two more for now. + +### 2.1. Create directories to house the code for the two GATK modules + +[TODO] + +### 2.2. Add import declarations to the workflow `main.nf` file + +Now all that remains is to add the import statements: + +[TODO] + +_Before:_ + +```groovy title="hello-modules.nf" linenums="3" +// Include modules +include { sayHello } from './modules/local/hello-modules.nf' + +workflow { +``` + +_After:_ + +```groovy title="hello-modules.nf" linenums="3" +// Include modules +include { sayHello } from './modules/local/sayHello.nf' +include { convertToUpper } from './modules/local/convertToUpper.nf' +include { collectGreetings } from './modules/local/collectGreetings.nf' + +workflow { +``` + +### 2.3. Run the workflow to verify that everything still works as expected + +Look at that short `hello-modules.nf` file! Let's run it once last time. + +```bash +nextflow run hello-modules.nf -resume +``` + +Yep, everything still works, including the resumability of the pipeline. + +```console title="Output" +[TODO] +``` + +Congratulations, you've done all this work and absolutely nothing has changed to how the pipeline works! + +Jokes aside, now your code is more modular, and if you decide to write another pipeline that calls on one of those processes, you just need to type one short import statement to use the relevant module. +This is better than just copy-pasting the code, because if later you decide to improve the module, all your pipelines will inherit the improvements. + +### Takeaway + +You know how to modularize multiple processes in a workflow. + +### What's next? + +Learn to manage inputs and parameters with more flexibility and convenience. + +--- + +## 3. [TODO] Subworkflow? diff --git a/docs/hello_nextflow/04_hello_params.md b/docs/hello_nextflow/04_hello_params.md deleted file mode 100644 index eea4fda2..00000000 --- a/docs/hello_nextflow/04_hello_params.md +++ /dev/null @@ -1,1186 +0,0 @@ -# Part 1: Hello World - -A "Hello World!" is a minimalist example that is meant to demonstrate the basic syntax and structure of a programming language or software framework. The example typically consists of printing the phrase "Hello, World!" to the output device, such as the console or terminal, or writing it to a file. - -In this first part of the Hello Nextflow training course, we ease into the topic with a very simple domain-agnostic Hello World example, which we'll progressively build up to demonstrate the usage of foundational Nextflow logic and components. - ---- - -## 0. Warmup: Run Hello World directly - -Let's demonstrate this with a simple command that we run directly in the terminal, to show what it does before we wrap it in Nextflow. - -### 0.1. Make the terminal say hello - -```bash -echo 'Hello World!' -``` - -### 0.2. Now make it write the text output to a file - -```bash -echo 'Hello World!' > output.txt -``` - -### 0.3. Verify that the output file is there using the `ls` command - -```bash -ls -``` - -### 0.4. Show the file contents - -```bash -cat output.txt -``` - -!!! tip - - In the Gitpod environment, you can also find the output file in the file explorer, and view its contents by clicking on it. Alternatively, you can use the `code` command to open the file for viewing. - - ```bash - code output.txt - ``` - -### Takeaway - -You now know how to run a simple command in the terminal that outputs some text, and optionally, how to make it write the output to a file. - -### What's next? - -Discover what that would look like written as a Nextflow workflow. - ---- - -## 1. Try the Hello World workflow starter script - -As mentioned in the orientation, we provide you with a fully functional if minimalist workflow script named `hello-world.nf` that does the same thing as before (write out 'Hello World!') but with Nextflow. - -To get you started, we'll first open up the workflow script so you can get a sense of how it's structured, then we'll run it (before trying to make any modifications) to verify that it does what we expect. - -### 1.1. Decipher the code structure - -Let's open the `hello-world.nf` script in the editor pane. - -!!! note - - The file is in the `hello-nextflow` directory, which should be your current working directory. - You can either click on the file in the file explorer, or type `ls` in the terminal and Cmd+Click (MacOS) or Ctrl+Click (PC) on the file to open it. - -```groovy title="hello-world.nf" linenums="1" -#!/usr/bin/env nextflow - -/* - * Use echo to print 'Hello World!' to standard out - */ -process sayHello { - - output: - stdout - - script: - """ - echo 'Hello World!' - """ -} - -workflow { - - // emit a greeting - sayHello() -} -``` - -As you can see, a Nextflow script involves two main types of core components: one or more **processes**, and the **workflow** itself. -Each **process** describes what operation(s) the corresponding step in the pipeline should accomplish, while the **workflow** describes the dataflow logic that connects the various steps. - -Let's take a closer look at the **process** block first, then we'll look at the **workflow** block. - -#### 1.1.1 The `process` definition - -The first block of code describes a **process**. -The process definition starts with the keyword `process`, followed by the process name and finally the process body delimited by curly braces. -The process body must contain a script block which specifies the command to run, which can be anything you would be able to run in a command line terminal. - -Here we have a **process** called `sayHello` that writes its **output** to `stdout`. - -```groovy title="hello-world.nf" linenums="3" -/* - * Use echo to print 'Hello World!' to standard out - */ -process sayHello { - - output: - stdout - - script: - """ - echo 'Hello World!' - """ -} -``` - -This a very minimal process definition that just contains an output definition and the script itself. -In a real-world pipeline, a process usually contains additional blocks such as directives, inputs, and conditional clauses, which we'll introduce later in this training course. - -!!! note - - The output definition does not _determine_ what output will be created. - It simply _declares_ what is the expected output, so that Nextflow can look for it once execution is complete. - This is necessary for verifying that the command was executed successfully and for passing the output to downstream processes if needed. - -#### 1.1.2 The `workflow` definition - -The second block of code describes the **workflow** itself. -The workflow definition starts with the keyword `workflow`, followed by an optional name, then the workflow body delimited by curly braces. - -Here we have a **workflow** that consists of one call to the `sayHello` process. - -```groovy title="hello-world.nf" linenums="16" -workflow { - - // emit a greeting - sayHello() -} -``` - -This a very minimal **workflow** definition. -In a real-world pipeline, the workflow typically contains multiple calls to **processes** connected by **channels**. -You'll learn how to add more processes and connect them by channels in a little bit. - -### 1.2. Run the workflow - -Looking at code is not nearly as fun as running it, so let's try this out in practice. - -```bash -nextflow run hello-world.nf -``` - -You console output should look something like this: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [reverent_carson] DSL2 - revision: 463b611a35 - -executor > local (1) -[1c/7d08e6] sayHello [100%] 1 of 1 ✔ -``` - -Congratulations, you just ran your first Nextflow workflow! - -The most important output here is the last line (line 6), which reports that the `sayHello` process was successfully executed once. - -Okay, that's great, but where do we find the output? -The `sayHello` process definition said that the output would be sent to standard out, but nothing got printed in the console, did it? - -### 1.3. Find the output and logs in the `work` directory - -When you run Nextflow for the first time in a given directory, it creates a directory called `work` where it will write all files (and symlinks) generated in the course of execution. -Have a look inside; you'll find a subdirectory named with a hash (in order to make it unique; we'll discuss why in a bit), nested two levels deep and containing a handful of log files. - -!!! tip - - If you browse the contents of the task subdirectory in the Gitpod's VSCode file explorer, you'll see all these files right away. - However, these files are set to be invisible in the terminal, so if you want to use `ls` or `tree` to view them, you'll need to set the relevant option for displaying invisible files. - - ```bash - tree -a work - ``` - - You should see something like this, though the exact subdirectory names will be different on your system. - - ```console title="Directory contents" - work - └── 1c - └── 7d08e685a7aa7060b9c21667924824 - ├── .command.begin - ├── .command.err - ├── .command.log - ├── .command.out - ├── .command.run - ├── .command.sh - └── .exitcode - ``` - -You may have noticed that the subdirectory names appeared (in truncated form) in the output from the workflow run, in the line that says: - -```console title="Output" -[1c/7d08e6] sayHello [100%] 1 of 1 ✔ -``` - -This tells you what is the subdirectory path for that specific process call (sometimes called task). - -!!! note - - Nextflow creates a separate unique subdirectory for each process call. - It stages the relevant input files, script, and other helper files there, and writes any output files and logs there as well. - -If we look inside the subdirectory, we find the following log files: - -- **`.command.begin`**: Metadata related to the beginning of the execution of the process task -- **`.command.err`**: Error messages (stderr) emitted by the process task -- **`.command.log`**: Complete log output emitted by the process task -- **`.command.out`**: Regular output (stdout) by the process task -- **`.command.sh`**: The command that was run by the process task call -- **`.exitcode`**: The exit code resulting from the command - -In this case, you can look for your output in the `.command.out` file, since that's where stdout output is captured. -If you open it, you'll find the `Hello World!` greeting, which was the expected result of our minimalist workflow. - -It's also worth having a look at the `.command.sh` file, which tells you what command Nextflow actually executed. In this case it's very straightforward, but later in the course you'll see commands that involve some interpolation of variables. When you're dealing with that, you need to be able to check exactly what was run, especially when troubleshooting an issue. - -### Takeaway - -You know how to decipher a simple Nextflow script, run it and find the output and logs in the work directory. - -### What's next? - -Learn how to make the script output a named file. - ---- - -## 3. Send the output to a file - -Instead of printing "Hello World!" to standard output, we'd prefer to save that output to a specific file, just like we did when running in the terminal earlier. -This is how most tools that you'll run as part of real-world pipelines typically behave; we'll see examples of that later. - -To achieve this result, both the script and the output definition blocks need to be updated. - -### 3.1. Change the process command to output a named file - -This is the same change we made when we ran the command directly in the terminal earlier. - -_Before:_ - -```groovy title="hello-world.nf" linenums="11" -""" -echo 'Hello World!' -""" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="11" -""" -echo 'Hello World!' > output.txt -""" -``` - -### 3.2. Change the output declaration in the `sayHello` process - -We need to tell Nextflow that it should now look for a specific file to be produced by the process execution. - -_Before:_ - -```groovy title="hello-world.nf" linenums="8" -output: - stdout -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="8" -output: - path 'output.txt' -``` - -!!! note - - Inputs and outputs in the process blocks typically require a qualifier and a variable name: - - ``` - - ``` - - The qualifier defines the type of data to be received. - This information is used by Nextflow to apply the semantic rules associated with each qualifier, and handle it properly. - Common qualifiers include `val` and `path`. - In the example above, `stdout` is an exception since it is not associated with a name. - -### 3.3. Run the workflow again - -```bash -nextflow run hello-world.nf -``` - -The log output should be very similar to the first time your ran the workflow: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [cranky_sinoussi] DSL2 - revision: 30b437bb96 - -executor > local (1) -[7a/6bd54c] sayHello [100%] 1 of 1 ✔ -``` - -Like you did before, find the `work` directory in the file explorer. -There, find the `output.txt` output file and click on it to open it, and verify that it contains the greeting as expected. - -!!! warning - - This example is brittle because we hardcoded the output filename in two separate places (the script and the output blocks). - If we change one but not the other, the script will break. - Later, you'll learn how to use variables to avoid this problem. - -### 3.4. Add a `publishDir` directive to the process - -You'll have noticed that the output is buried in a working directory several layers deep. -Nextflow is in control of this directory and we are not supposed to interact with it. -To make the output file more accessible, we can utilize the `publishDir` directive. -By specifying this directive, we are telling Nextflow to automatically copy the output file to a designated output directory. -This allows us to leave the working directory alone, while still having easy access to the desired output file. - -_Before:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - output: - path 'output.txt' -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - publishDir 'results', mode: 'copy' - - output: - path 'output.txt' -``` - -!!! note - - There is a newer syntax option that makes it possible to declare and publish workflow-level outputs, documented [here](https://www.nextflow.io/docs/latest/workflow.html#publishing-outputs), which makes using `publishDir` at the process level redundant once your pipeline is fully operational. - However, `publishDir` is still very useful during pipeline development; that is why we include it in this training series. - This will also ensure that you can read and understand the large number of pipelines that have already been written with `publishDir`. - - You'll learn how to use the workflow-level outputs syntax later in this training series. - -### 3.5. Run the workflow again - -```bash -nextflow run hello-world.nf -``` - -The log output should start looking very familiar: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [mighty_lovelace] DSL2 - revision: 6654bc1327 - -executor > local (1) -[10/15498d] sayHello [100%] 1 of 1 ✔ -``` - -This time, Nextflow will have created a new directory called `results/`. -In this directory is our `output.txt` file. -If you check the contents it should match the output in our work/task directory. -This is how we move results files outside of the working directories. - -### Takeaway - -You know how to send outputs to a specific named file and use the `publishDir` directive to move files outside of the Nextflow working directory. - -### What's next? - -Learn how to make Nextflow resume running a pipeline using cached results from a prior run to skip any steps it had already completed successfully. - ---- - -## 4. Use the Nextflow resume feature - -Nextflow has an option called `-resume` that allows you to re-run a pipeline you've already launched previously. -When launched with `-resume` any processes that have already been run with the exact same code, settings and inputs will be skipped. -Using this mode means Nextflow will only run processes that are either new, have been modified or are being provided new settings or inputs. - -There are two key advantages to doing this: - -- If you're in the middle of developing your pipeline, you can iterate more rapidly since you only effectively have to run the process(es) you're actively working on in order to test your changes. -- If you're running a pipeline in production and something goes wrong, in many cases you can fix the issue and relaunch the pipeline, and it will resume running from the point of failure, which can save you a lot of time and compute. - -### 4.1. Run the workflow again with `-resume` - -```bash -nextflow run hello-world.nf -resume -``` - -The console output should look similar. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [thirsty_gautier] DSL2 - revision: 6654bc1327 - -[10/15498d] sayHello [100%] 1 of 1, cached: 1 ✔ -``` - -Notice the additional `cached:` bit in the process status line, which means that Nextflow has recognized that it has already done this work and simply re-used the result from the last run. - -!!! note - - When your re-run a pipeline with `resume`, Nextflow does not overwrite any files written to a publishDir directory by any process call that was previously run successfully. - -### Takeaway - -You know how to to relaunch a pipeline without repeating steps that were already run in an identical way. - -### What's next? - -Learn how to add in variable inputs. - ---- - -## 5. Add in variable inputs using a channel - -So far, we've been emitting a greeting hardcoded into the process command. -Now we're going to add some flexibility by using an input variable, so that we can easily change the greeting. - -This requires us to make a series of inter-related changes: - -1. Tell the process about expected variable inputs using the `input:` block -2. Edit the process to use the input -3. Create a **channel** to pass input to the process (more on that in a minute) -4. Add the channel as input to the process call - -### 5.1. Add an input definition to the process block - -First we need to adapt the process definition to accept an input. - -_Before:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - publishDir 'results', mode: 'copy' - - output: - path "output.txt" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="6" -process sayHello { - - publishDir 'results', mode: 'copy' - - input: - val greeting - - output: - path "output.txt" -``` - -### 5.2. Edit the process command to use the input variable - -Now we swap the original hardcoded value for the input variable. - -_Before:_ - -```groovy title="hello-world.nf" linenums="16" -""" -echo 'Hello World!' > output.txt -""" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="16" -""" -echo '$greeting' > output.txt -""" -``` - -### 5.3. Create an input channel - -Now that our process expects an input, we need to set up that input in the workflow body. -This is where channels come in: Nextflow uses channels to feed inputs to processes and ferry data between processes that are connected together. - -There are multiple ways to do this, but for now, we're just going to use the simplest possible channel, containing a single value. - -We're going to create the channel using the `of()` channel factory, which sets up a simple value channel, and give it a hardcoded string to use as greeting by declaring `greeting_ch = Channel.of('Hello world!')`. - -_Before:_ - -```groovy title="hello-world.nf" linenums="21" -workflow { - - // emit a greeting - sayHello() -} -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="21" -workflow { - - // create a channel for inputs - greeting_ch = Channel.of('Hello world!') - - // emit a greeting - sayHello() -} -``` - -### 5.4. Add the channel as input to the process call - -Now we need to actually plug our newly created channel into the `sayHello()` process call. - -_Before:_ - -```groovy title="hello-world.nf" linenums="26" -// emit a greeting -sayHello() -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="26" -// emit a greeting -sayHello(greeting_ch) -``` - -### 5.5. Run the workflow command again - -Let's run it! - -```bash -nextflow run hello-world.nf -``` - -If you made all four edits correctly, you should get another successful execution: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [prickly_avogadro] DSL2 - revision: b58b6ab94b - -executor > local (1) -[1f/50efd5] sayHello (1) [100%] 1 of 1 ✔ -``` - -Feel free to check the results directory to satisfy yourself that the outcome is still the same as previously; so far we're just progressively tweaking the internal plumbing to increase the flexibility of our workflow while achieving the same end result. - -### Takeaway - -You know how to use a simple channel to provide an input to a process. - -### What's next? - -Learn how to pass inputs from the command line. - ---- - -## 6. Use CLI parameters for inputs - -We want to be able to specify the input from the command line, since that is the piece that will almost always be different in subsequent runs of the workflow. -Good news: Nextflow has a built-in workflow parameter system called `params`, which makes it easy to declare and use CLI parameters. - -### 6.1. Edit the input channel declaration to use a parameter - -Here we swap out the hardcoded string for `params.greeting` in the channel creation line. - -_Before:_ - -```groovy title="hello-world.nf" linenums="23" -// create a channel for inputs -greeting_ch = Channel.of('Hello world!') -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="23" -// create a channel for inputs -greeting_ch = Channel.of(params.greeting) -``` - -This automatically creates a parameter called `greeting` that you can use to provide a value in the command line. - -### 6.2. Run the workflow again with the `--greeting` parameter - -To provide a value for this parameter, simply add `--greeting ` to your command line. - -```bash -nextflow run hello-world.nf --greeting 'Bonjour le monde!' -``` - -Running this should feel extremely familiar by now. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [cheesy_engelbart] DSL2 - revision: b58b6ab94b - -executor > local (1) -[1c/9b6dc9] sayHello (1) [100%] 1 of 1 ✔ -``` - -Be sure to open up the output file to check that you now have the new version of the greeting. Voilà! - -!!! tip - - It's helpful to distinguish Nextflow-level parameters from pipeline-level parameters. - For parameters that apply to a pipeline, we use a double hyphen (`--`), whereas we use a single hyphen (`-`) for parameters that modify a specific Nextflow setting, _e.g._ the `-resume` feature we used earlier. - -### 6.3. Set a default value for a command line parameter - -In many cases, it makes sense to supply a default value for a given parameter so that you don't have to specify it for every run. - -Let's initialize the `greeting` parameter with a default value by adding the parameter declaration at the top of the script (with a comment block as a free bonus). - -```groovy title="hello-world.nf" linenums="3" -/* - * Pipeline parameters - */ -params.greeting = "Holà mundo!" -``` - -### 6.4. Run the workflow again without specifying the parameter - -Now that you have a default value set, you can run the workflow again without having to specify a value in the command line. - -```bash -nextflow run hello-world.nf -``` - -The output should look the same. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [wise_waddington] DSL2 - revision: 988fc779cf - -executor > local (1) -[c0/8b8332] sayHello (1) [100%] 1 of 1 ✔ -``` - -Check the output in the results directory, and... Tadaa! It works! Nextflow used the default value to name the output. But wait, what happens now if we provide the parameter in the command line? - -### 6.5. Run the workflow again with the `--greeting` parameter on the command line using a different greeting - -```bash -nextflow run hello-world.nf --greeting 'Konnichiwa!' -``` - -Nextflow's not complaining, that's a good sign: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [prickly_miescher] DSL2 - revision: 988fc779cf - -executor > local (1) -[56/f88a56] sayHello (1) [100%] 1 of 1 ✔ -``` - -Check the results directory and look at the contents of `output.txt`. Tadaa again! - -The value of the parameter we passed on the command line overrode the value we gave the variable in the script. In fact, parameters can be set in several different ways; if the same parameter is set in multiple places, its value is determined based on the order of precedence that is described [here](https://www.nextflow.io/docs/latest/config.html). - -!!! tip - - You can put the parameter declaration inside the workflow block if you prefer. Whatever you choose, try to group similar things in the same place so you don't end up with declarations all over the place. - -### Takeaway - -You know how to set up an input variable for a process and supply a value in the command line. - -### What's next? - -Learn how to add in a second process and chain them together. - ---- - -## 7. Add a second step to the workflow - -Most real-world workflows involve more than one step. Here we introduce a second process that converts the text to uppercase (all-caps), using the classic UNIX one-liner: - -```bash -tr '[a-z]' '[A-Z]' -``` - -We're going to run the command by itself in the terminal first to verify that it works as expected without any of the workflow code getting in the way of clarity, just like we did at the start with `echo 'Hello World'`. Then we'll write a process that does the same thing, and finally we'll connect the two processes so the output of the first serves as input to the second. - -### 7.1. Run the command in the terminal by itself - -```bash -echo 'Hello World' | tr '[a-z]' '[A-Z]' -``` - -The output is simply the uppercase version of the text string: - -```console title="Output" -HELLO WORLD -``` - -!!! note - - This is a very naive text replacement one-liner that does not account for accented letters, so for example 'Holà' will become 'HOLà'. This is expected. - -### 7.2. Make the command take a file as input and write the output to a file - -As previously, we want to output results to a dedicated file, which we name by prepending the original filename with `UPPER-`. - -```bash -cat output.txt | tr '[a-z]' '[A-Z]' > UPPER-output.txt -``` - -Now the `HELLO WORLD` output is in the new output file, `UPPER-output.txt`. - -### 7.3. Wrap the command in a new Nextflow process definition - -We can model our new process on the first one, since we want to use all the same components. - -```groovy title="hello-world.nf" linenums="26" -/* - * Use a text replace utility to convert the greeting to uppercase - */ -process convertToUpper { - - publishDir 'results', mode: 'copy' - - input: - path input_file - - output: - path "UPPER-${input_file}" - - script: - """ - cat '$input_file' | tr '[a-z]' '[A-Z]' > UPPER-${input_file} - """ -} -``` - -As a little bonus, here we composed the second output filename based on the first one. - -!!! tip - - Very important to remember: you have to use double quotes around the output filename expression (NOT single quotes) or it will fail. - -### 7.4. Add a call to the new process in the workflow body - -Don't forget we need to tell Nextflow to actually call the process we just created! To do that, we add it to the `workflow` body. - -```groovy title="hello-world.nf" linenums="44" -workflow { - - // create a channel for inputs - greeting_ch = Channel.of(params.greeting) - - // emit a greeting - sayHello(greeting_ch) - - // convert the greeting to uppercase - convertToUpper() -} -``` - -Looking good! But we still need to wire up the `convertToUpper` process call to run on the output of `sayHello`. - -### 7.5. Pass the output of the first process to the second process - -The output of the `sayHello` process is automatically packaged as a channel called `sayHello.out`, so all we need to do is pass that as the input to the `convertToUpper` process. - -```groovy title="hello-world.nf" linenums="52" -// convert the greeting to uppercase -convertToUpper(sayHello.out) -``` - -For a simple case like this, that's all we need to do to connect two processes! - -### 7.6. Run the same workflow command as before - -Let's make sure this works: - -```bash -nextflow run hello-world.nf --greeting 'Hello World!' -``` - -Oh, how exciting! There is now an extra line in the log output, which corresponds to the new process we just added: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [magical_brenner] DSL2 - revision: 0e18f34798 - -executor > local (2) -[57/3836c0] sayHello (1) [100%] 1 of 1 ✔ -[ee/bb3cc8] convertToUpper (1) [100%] 1 of 1 ✔ -``` - -You'll notice that this time the workflow produced two new work subdirectories; one per process call. -Check out the work directory of the call to the second process, where you should find two different output files listed. If you look carefully, you'll notice one of them (the output of the first process) has a little arrow icon on the right; that signifies it's a symbolic link. -It points to the location where that file lives in the work directory of the first process. -By default, Nextflow uses symbolic links to stage input files whenever possible, to avoid making duplicate copies. - -!!! note - - All we did was connect the output of `sayHello` to the input of `convertToUpper` and the two processes could be run in serial. - Nextflow did the hard work of handling input and output files and passing them between the two commands for us. - This is the power of channels in Nextflow, doing the busywork of connecting our pipeline steps together. - - What's more, Nextflow will automatically determine which call needs to be executed first based on how they're connected, so the order in which they're written in the workflow body does not matter. - However, we do recommend you be kind to your collaborators and to your future self, and try to write them in a logical order! - -### Takeaway - -You know how to add a second step that takes the output of the first step as input. - -### What's next? - -Learn how to make the workflow run on a batch of input values. - ---- - -## 8. Modify the workflow to run on a batch of input values - -Workflows typically run on batches of inputs that are meant to be processed in bulk, so we want to upgrade the workflow to accept multiple input values. - -Conveniently, the `of()` channel factory we've been using is quite happy to accept more than one value, so we don't need to modify that at all; we just have to load more values into the channel. - -### 8.1. Load multiple greetings into the input channel - -To keep things simple, we go back to hardcoding the greetings in the channel factory instead of using a parameter for the input, but we'll improve on that shortly. - -_Before:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of(params.greeting) -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of('Hello','Bonjour','Holà') -``` - -The documentation tells us this should work. Can it really be so simple? - -### 8.2. Run the command and look at the log output - -Let's try it. - -```bash -nextflow run hello-world.nf -``` - -Well, it certainly seems to run just fine. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [lonely_pare] DSL2 - revision: b9f1d96905 - -executor > local (6) -[3d/1fe62c] sayHello (2) [100%] 3 of 3 ✔ -[86/695813] convertToUpper (3) [100%] 3 of 3 ✔ -``` - -However... This seems to indicate that '3 of 3' calls were made for each process, which is encouraging, but this only give us one subdirectory path for each. What's going on? - -By default, the ANSI logging system writes the logging from multiple calls to the same process on the same line. Fortunately, we can disable that behavior. - -### 8.3. Run the command again with the `-ansi-log false` option - -To expand the logging to display one line per process call, just add `-ansi-log false` to the command. - -```bash -nextflow run hello-world.nf -ansi-log false -``` - -This time we see all six work subdirectories listed in the output: - -```console title="Output" -N E X T F L O W ~ version 24.02.0-edge -Launching `hello-world.nf` [big_woese] DSL2 - revision: 53f20aeb70 -[62/d81e63] Submitted process > sayHello (1) -[19/507af3] Submitted process > sayHello (2) -[8a/3126e6] Submitted process > sayHello (3) -[12/48a5c6] Submitted process > convertToUpper (1) -[73/e6e746] Submitted process > convertToUpper (2) -[c5/4fedda] Submitted process > convertToUpper (3) -``` - -That's much better; at least for this number of processes. -For a complex workflow, or a large number of inputs, having the full list output to the terminal might get a bit overwhelming. - -That being said, we have another problem. If you look in the `results` directory, there are only two files: `output.txt` and `UPPER-output.txt`! - -```console title="Directory contents" -results -├── output.txt -└── UPPER-output.txt -``` - -What's up with that? Shouldn't we be expecting two files per input greeting, so six files in all? - -You may recall that we hardcoded the output file name for the first process. -This was fine as long as there was only a single call made per process, but when we start processing multiple input values and publishing the outputs into the same directory of results, it becomes a problem. -For a given process, every call produces an output with the same file name, so Nextflow just overwrites the previous output file every time a new one is produced. - -### 8.4. Ensure the output file names will be unique - -Since we're going to be publishing all the outputs to the same results directory, we need to ensure they will have unique names. -Specifically, we need to modify the first process to generate a file name dynamically so that the final file names will be unique. - -So how do we make the file names unique? A common way to do that is to use some unique piece of metadata as part of the file name. -Here, for convenience, we'll just use the greeting itself. - -_Before:_ - -```groovy title="hello-world.nf" linenums="11" -process sayHello { - - publishDir 'results', mode: 'copy' - - input: - val greeting - - output: - path "output.txt" - - script: - """ - echo '$greeting' > "output.txt" - """ -} -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="11" -process sayHello { - - publishDir 'results', mode: 'copy' - - input: - val greeting - - output: - path "${greeting}-output.txt" - - script: - """ - echo '$greeting' > '$greeting-output.txt' - """ -} -``` - -This should produce a unique output file name for every call of each process. - -### 8.5. Run the workflow and look at the results directory - -Let's run it and check that it works. - -```bash -nextflow run hello-world.nf -``` - -Reverting back to the summary view, the output looks like this again: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [jovial_mccarthy] DSL2 - revision: 53f20aeb70 - -executor > local (6) -[03/f007f2] sayHello (1) [100%] 3 of 3 ✔ -[e5/dd2890] convertToUpper (3) [100%] 3 of 3 ✔ -``` - -But more importantly, now we have six new files in addition to the two we already had in the `results` directory: - -```console title="Directory contents" -results -├── Bonjour-output.txt -├── Hello-output.txt -├── Holà-output.txt -├── output.txt -├── UPPER-Bonjour-output.txt -├── UPPER-Hello-output.txt -├── UPPER-Holà-output.txt -└── UPPER-output.txt -``` - -Success! Now we can add as many greetings as we like without worrying about output files being overwritten. - -!!! note - - In practice, naming files based on the input data itself is almost always impractical. The better way to generate dynamic filenames is to use a samplesheet contain relevant metadata (such as unique sample IDs) and create a data structure called a 'map', which we pass to processes, and from which we can grab an appropriate identifier to generate the filenames. - We'll show you how to do that later in this training course. - -### Takeaway - -You know how to feed a batch of multiple input elements through a channel. - -### What's next? - -Learn how to make the workflow take a file as its source of input values. - ---- - -## 9. Modify the workflow to take a file as its source of input values - -It's often the case that, when we want to run on a batch of multiple input elements, the input values are contained in a file. -As an example, we have provided you with a CSV file called `greetings.csv` in the `data/` directory, containing several greetings separated by commas. - -```csv title="greetings.csv" -Hello,Bonjour,Holà -``` - -So we just need to modify our workflow to read in the values from a file like that. - -### 9.1. Set up a CLI parameter with a default value pointing to an input file - -First, let's use the `params` system to set up a new parameter called `input_file`, replacing the now useless `greeting` parameter, with a default value pointing to the `greetings.csv` file. - -_Before:_ - -```groovy title="hello-world.nf" linenums="6" -/* - * Pipeline parameters - */ -params.greeting = "Holà mundo!" -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="6" -/* - * Pipeline parameters - */ -params.input_file = "data/greetings.csv" -``` - -### 9.2. Update the channel declaration to handle the input file - -At this point we introduce a new channel factory, `fromPath()`, which has some built-in functionality for handling file paths. -We're going to use that instead of the `of()` channel factory we used previously; the base syntax looks like this: - -```groovy title="channel construction syntax" -Channel.fromPath(params.input_file) -``` - -Now, we are going to deploy a new concept, an 'operator' to transform that CSV file into channel content. You'll learn more about operators later, but for now just understand them as ways of transforming channels in a variety of ways. - -Since our goal is to read in the contents of a `.csv` file, we're going to add the `.splitCsv()` operator to make Nextflow parse the file contents accordingly, as well as the `.flatten()` operator to turn the array element produced by `.splitCsv()` into a channel of individual elements. - -So the channel construction instruction becomes: - -```groovy title="channel construction syntax" -Channel.fromPath(params.input_file) - .splitCsv() - .flatten() -``` - -And here it is in the context of the workflow body: - -_Before:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs -greeting_ch = Channel.of('Hello','Bonjour','Holà') -``` - -_After:_ - -```groovy title="hello-world.nf" linenums="46" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath(params.input_file) - .splitCsv() - .flatten() -``` - -If you want to see the impact of `.flatten()`, we can make use of `.view()`, another operator, to demonstrate. Edit that section of code so it looks like: - -```groovy title="flatten usage" -// create a channel for inputs from a CSV file -greeting_ch = Channel.fromPath(params.input_file) - .splitCsv() - .view{ "After splitCsv: $it" } - .flatten() - .view{ "After flatten: $it" } -``` - -When you run this updated workflow, you'll see the difference: - -```console title="view output with and without flatten" -After splitCsv: [Hello, Bonjour, Holà] -After flatten: Hello -After flatten: Bonjour -After flatten: Holà -[d3/1a6e23] Submitted process > sayHello (3) -[8f/d9e431] Submitted process > sayHello (1) -[e7/a088af] Submitted process > sayHello (2) -[1a/776e2e] Submitted process > convertToUpper (1) -[83/fb8eba] Submitted process > convertToUpper (2) -[ee/280f93] Submitted process > convertToUpper (3) -``` - -As you can see, the `flatten()` operator has transformed the channel from containing arrays to containing individual elements. This can be useful when you want to process each item separately in your workflow. - -Remove the `.view()` operations before you continue. - -!!! tip - - While you're developing your pipeline, you can inspect the contents of any channel by adding the `.view()` operator to the name of the channel. - For example, if you add `greeting_ch.view()` anywhere in the workflow body, when you run the script, Nextflow will print the channel contents to standard out. - - You can also use this to inspect the effect of the operators. - For example, the output of `Channel.fromPath(params.input_file).splitCsv().view()` will look like this: - - ```console title="Output" - [Hello, Bonjour, Holà] - ``` - - While the output of `Channel.fromPath(params.input_file).splitCsv().flatten().view()` will look like this: - - ```console title="Output" - Hello - Bonjour - Holà - ``` - -### 9.3. Run the workflow (one last time!) - -```bash -nextflow run hello-world.nf -``` - -Once again we see each process get executed three times: - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `hello-world.nf` [angry_spence] DSL2 - revision: d171cc0193 - -executor > local (6) -[0e/ceb175] sayHello (2) [100%] 3 of 3 ✔ -[01/046714] convertToUpper (3) [100%] 3 of 3 ✔ -``` - -Looking at the outputs, we see each greeting was correctly extracted and processed through the workflow. We've achieved the same result as the previous step, but now we have a lot more flexibility to add more elements to the channel of greetings we want to process. - -### Takeaway - -You know how to provide the input values to the workflow via a file. - -More generally, you've learned how to use the essential components of Nextflow and you have a basic grasp of the logic of how to build a workflow and manage inputs and outputs. - -### What's next? - -Celebrate your success and take a break! - -Don't worry if the channel types and operators feel like a lot to grapple with the first time you encounter them. -You'll get more opportunities to practice using these components in various settings as you work through this training course. - -When you're ready, move on to Part 2 to learn about another important concept: provisioning the software required for each process. diff --git a/docs/hello_nextflow/06_hello_containers.md b/docs/hello_nextflow/05_hello_containers.md similarity index 86% rename from docs/hello_nextflow/06_hello_containers.md rename to docs/hello_nextflow/05_hello_containers.md index 50a710e2..e8de3615 100644 --- a/docs/hello_nextflow/06_hello_containers.md +++ b/docs/hello_nextflow/05_hello_containers.md @@ -1,6 +1,6 @@ -# Part 2: Hello Containers +# Part 5: Hello Containers -In Part 1, you learned how to use the basic building blocks of Nextflow to assemble a simple pipeline capable of processing some text and parallelizing execution if there were multiple inputs. +In Parts X-Y, you learned how to use the basic building blocks of Nextflow to assemble a simple pipeline capable of processing some text and parallelizing execution if there were multiple inputs. However, you were limited to basic UNIX tools available in your environment. Real-world tasks often require various tools and packages not included by default. @@ -14,20 +14,26 @@ That is all very tedious and annoying, so we're going to show you how to use **c --- -## 1. Use a container directly +## 0. Warmup: Pull the container image A **container** is a lightweight, standalone, executable unit of software created from a container **image** that includes everything needed to run an application including code, system libraries and settings. To use a container you usually download or "pull" a container image from a container registry, and then run the container image to create a container instance. -### 1.1. Pull the container image - Let's pull a container image that contains the `cowsay` command so we can use it to display some text in a fun way. ```bash docker pull 'community.wave.seqera.io/library/pip_cowsay:131d6a1b707a8e65' ``` -### 1.2 Use the container to execute a single command +[TODO] SEGUE + +--- + +## 1. Use a container directly (one-off) + +[TODO] EXPLAIN USE CASE + +### 1.1 Use the container to execute a single command The `docker run` command is used to spin up a container instance from a container image and execute a command in it. The `--rm` flag tells Docker to remove the container instance after the command has completed. @@ -47,10 +53,26 @@ docker run --rm 'community.wave.seqera.io/library/pip_cowsay:131d6a1b707a8e65' c || || ``` -### 1.2. Spin up the container interactively +[TODO] EXPLAIN WHAT HAPPENED + +### Takeaway + +You know how to pull a container and run it directly in the terminal as a one-off execution. + +### What's next? + +[TODO] UPDATE LEARN TO RUN INTERACTIVELY + +--- + +## 2. Use a container interactively You can also run a container interactively, which will give you a shell prompt inside the container. +### 2.1. Spin up the container + +[TODO] INSTRUCTION + ```bash docker run --rm -it 'community.wave.seqera.io/library/pip_cowsay:131d6a1b707a8e65' /bin/bash ``` @@ -65,7 +87,7 @@ bin dev etc home lib media mnt opt proc root run sbi You can see that the filesystem inside the container is different from the filesystem on your host system. -### 1.3. Run the command +### 2.2. Run the desired tool command Now that you are inside the container, you can run the `cowsay` command directly. @@ -96,7 +118,7 @@ Output: \___)=(___/ ``` -### 1.4. Exit the container +### 2.3. Exit the container To exit the container, you can type `exit` at the prompt or use the ++ctrl+d++ keyboard shortcut. @@ -106,7 +128,7 @@ exit Your prompt should now be back to what it was before you started the container. -### 1.5. Mounting data into containers +### 2.4. Mounting data into containers When you run a container, it is isolated from the host system by default. This means that the container can't access any files on the host system unless you explicitly tell it to. @@ -131,7 +153,7 @@ conda.yml environment.lock greetings.csv pioneers.csv ``` -### 1.6. Use the mounted data +### 2.5. Use the mounted data Now that we have mounted the `data` directory into the container, we can use the `cowsay` command to display the contents of the `greetings.csv` file. To do this we'll use the syntax `-t "$(cat data/greetings.csv)"` to output the contents of the file into the `cowsay` command. @@ -174,20 +196,22 @@ You know how to pull a container and run it interactively, make your data access ### What's next? -[TODO] update text (was wrong one) +[TODO] UPDATE USE CONTAINERS IN NF WORKFLOW --- -## 2. Use containers in Nextflow +## 3. Use containers in Nextflow Nextflow has built-in support for running processes inside containers to let you run tools you don't have installed in your compute environment. This means that you can use any container image you like to run your processes, and Nextflow will take care of pulling the image, mounting the data, and running the process inside it. [TODO] [Update this to add a cowsay step to the hello-world pipeline (just add it after the uppercase step) -- include passing in the character as a parameter] -### 2.1. Add a container directive to your process +### 3.1. Write a `cowsay` module [TODO] UPDATE -Edit the `hello-containers.nf` script to add a `container` directive to the `cowsay` process. +[TODO] UPDATE TO GRAFT THIS ONTO THE MODULARIZED HELLO WORKFLOW + +Edit the `hello-containers.nf` script to add the `cowsay` module. _Before:_ @@ -206,7 +230,15 @@ process cowSay { container 'community.wave.seqera.io/library/pip_cowsay:131d6a1b707a8e65' ``` -### 2.2. Run Nextflow pipelines using containers +### 3.2. Import the `cowsay` module into the workflow [TODO] UPDATE + +[TODO] + +### 3.3. Connect the `cowsay` process to the workflow [TODO] UPDATE + +[TODO] + +### 3.4. Run the workflow Run the script to see the container in action. @@ -214,13 +246,11 @@ Run the script to see the container in action. nextflow run hello-containers.nf ``` -!!! NOTE +!!! NOTE [TODO] MAYBE CHANGE THIS TO USE `-with-docker` AND INTRODUCE CONFIG IN NEXT SECTION The `nextflow.config` in our current working directory contains `docker.enabled = true`, which tells Nextflow to use Docker to run processes. Without that configuration we would have to specify the `-with-docker` flag when running the script. -### 2.3. Check the results - You should see a new directory called `containers/results` that contains the output of the `cowsay` process. ```console title="containers/results/cowsay-output-Bonjour.txt" @@ -236,7 +266,7 @@ You should see a new directory called `containers/results` that contains the out || || ``` -### 2.4. Explore how Nextflow launched the containerized task +### 3.5. Inspect how Nextflow launched the containerized task Let's take a look at the task directory for one of the cowsay tasks to see how Nextflow works with containers under the hood. @@ -283,4 +313,4 @@ You know how to use containers in Nextflow to run processes. ### What's next? -[TODO] +[TODO] CONFIGURE STUFF diff --git a/docs/hello_nextflow/05_hello_modules.md b/docs/hello_nextflow/05_hello_modules.md deleted file mode 100644 index a9cec9ff..00000000 --- a/docs/hello_nextflow/05_hello_modules.md +++ /dev/null @@ -1,411 +0,0 @@ -# Part 6: Hello Modules - -This section covers how to organize your workflow code to make development and maintenance of your pipeline more efficient and sustainable. -Specifically, we are going to demonstrate how to use **modules**. - -In Nextflow, a **module** is a single process definition that is encapsulated by itself in a standalone code file. -To use a module in a workflow, you just add a single-line import statement to your workflow code file; then you can integrate the process into the workflow the same way you normally would. - -Putting processes into individual modules makes it possible to reuse process definitions in multiple workflows without producing multiple copies of the code. -This makes the code more shareable, flexible and maintainable. - -!!!note - - It is also possible to encapsulate a section of a workflow as a 'subworkflow' that can be imported into a larger pipeline, but that is outside the scope of this training. - ---- - -## 0. Warmup [TODO] GET RID OF MOST OF THIS - -When we started developing our workflow, we put everything in one single code file. -In Part 5 (Hello Config), we started turning our one-file workflow into a proper pipeline project. -We moved to the standard Nextflow convention of naming the workflow file `main.nf`, fleshed out the configuration file, and added a parameter file. - -Now it's time to tackle **modularizing** our code, _i.e._ extracting the process definitions into modules. - -We're going to be working with a clean set of project files inside the project directory called `hello-modules` (for Modules). - -### 0.1. Explore the `hello-modules` directory - -Let's move into the project directory. - -```bash -cd hello-modules -``` - -!!! warning - - If you're continuing on directly from Part 5, you'll need to move up one directory first. - ``` - cd ../hello-modules - ``` - -The `hello-modules` directory has the same content and structure that you're expected to end up with in `hello-config` on completion of Part 5. - -```console title="Directory contents" -hello-modules/ -├── demo-params.json -├── main.nf -└── nextflow.config -``` - -For a detailed description of these files, see the warmup section in Part 5. - -### 0.2. Create a symbolic link to the data - -Just like last time, we need to set up a symlink to the data. -To do so, run this command from inside the `hello-modules` directory: - -```bash -ln -s ../data data -``` - -This creates a symbolic link called `data` pointing to the data directory one level up. - -### 0.3 Run the workflow using the appropriate profiles - -Now that everything is in place, we should be able to run the workflow using the profiles we set up in Part 5. - -```bash -nextflow run main.nf -profile my_laptop,demo -``` - -And so it does. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `main.nf` [special_brenner] DSL2 - revision: 5a07b4894b - -executor > local (7) -[26/60774a] SAMTOOLS_INDEX (1) | 3 of 3 ✔ -[5a/eb40c4] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔ -[8f/94ac86] GATK_JOINTGENOTYPING | 1 of 1 ✔ -``` - -Like previously, there will now be a `work` directory and a `results_genomics` directory inside your project directory. - -### Takeaway - -You're ready to start modularizing your workflow. - -### What's next? - -Learn how to create your first module following conventions inspired by the nf-core project. - ---- - -## 1. Create a module for the `SAMTOOLS_INDEX` process - -From a technical standpoint, you can create a module simply by copying the process definition into its own file, and you can name that file anything you want. -However, the Nextflow community has adopted certain conventions for code organization, influenced in large part by the [nf-core](https://nf-co.re) project (which we'll cover later in this training series). - -The convention for Nextflow modules is that the process definition should be written to a standalone file named `main.nf`, stored in a directory structure with three to four levels: - -```console title="Directory structure" -modules -└── local - └── () - └── - └── main.nf -``` - -By convention, all modules are stored in a directory named `modules`. -Additionally, the convention distinguishes _local_ modules (which are part of your project) from _remote_ modules contained in remote repositories. - -The next levels down are named after the toolkit (if there is one) then the tool itself. -If the process defined in the module invokes more than one tool, as the GATK_JOINTGENOTYPING does in our example workflow, the name of the module can be the name of the method, or something to that effect. - -For example, the module we create for the `SAMTOOLS_INDEX` process will live under `modules/local/samtools/index/`. - -```console title="Directory structure" -modules -└── local - └── samtools - └── index - └── main.nf -``` - -!!!note - - We will cover remote modules later in this training, when we introduce the [nf-core library of modules](https://nf-co.re/modules/). - -So let's get started. - -### 2.1. Create a directory to house the local module code for the `SAMTOOLS_INDEX` process - -Run this command to create the appropriate directory structure: - -```bash -mkdir -p modules/local/samtools/index -``` - -The `-p` flag takes care of creating parent directories as needed. - -### 2.2. Create a file stub for the `SAMTOOLS_INDEX` process module - -Now let's create an empty `main.nf` file for the module. - -```bash -touch modules/local/samtools/index/main.nf -``` - -This gives us a place to put the process code. - -### 2.3. Move the `SAMTOOLS_INDEX` process code to the module file - -Copy the whole process definition over from the workflow's `main.nf` file to the module's `main.nf` file, making sure to copy over the `#!/usr/bin/env nextflow` shebang too. - -```groovy title="hello-modules/modules/local/samtools/index/main.nf" linenums="1" -#!/usr/bin/env nextflow - -/* - * Generate BAM index file - */ -process SAMTOOLS_INDEX { - - container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464' - conda "bioconda::samtools=1.20" - - publishDir params.outdir, mode: 'symlink' - - input: - path input_bam - - output: - tuple path(input_bam), path("${input_bam}.bai") - - script: - """ - samtools index '$input_bam' - """ -} -``` - -Once that is done, delete the process definition from the workflow's `main.nf` file, but make sure to leave the shebang in place. - -### 2.4. Add an import declaration before the workflow block - -The syntax for importing a local module is fairly straightforward: - -```groovy title="Import declaration syntax" -include { } from './modules/local/>//main.nf' -``` - -Let's insert that above the workflow block and fill it out appropriately. - -_Before:_ - -```groovy title="hello-modules/main.nf" linenums="73" -workflow { -``` - -_After:_ - -```groovy title="hello-modules/main.nf" linenums="73" -// Include modules -include { SAMTOOLS_INDEX } from './modules/local/samtools/index/main.nf' - -workflow { -``` - -### 2.5. Run the workflow to verify that it does the same thing as before - -We're running the workflow with essentially the same code and inputs as before, so let's add the `-resume` flag and see what happens. - -```bash -nextflow run main.nf -profile my_laptop,demo -resume -``` - -Sure enough, Nextflow recognizes that it's still all the same work to be done, even if the code is split up into multiple files. - -```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `main.nf` [agitated_cuvier] DSL2 - revision: 0ce0cd0c04 - -[c3/0d53a4] SAMTOOLS_INDEX (3) | 3 of 3, cached: 3 ✔ -[c6/8c6c30] GATK_HAPLOTYPECALLER (1) | 3 of 3, cached: 3 ✔ -[38/82b2e2] GATK_JOINTGENOTYPING | 1 of 1, cached: 1 ✔ -``` - -So modularizing the code in the course of development does not break resumability! - -### Takeaway - -You know how to extract a process into a local module. - -### What's next? - -Practice making more modules. - ---- - -## 3. Repeat procedure for the remaining processes - -Once you've done one, you can do a million modules... -But let's just do two more for now. - -### 3.1. Create directories to house the code for the two GATK modules - -Since GATK_HAPLOTYPECALLER and GATK_JOINTGENOTYPING both run GATK tools, we'll house them both under a shared `gatk` directory. - -```bash -mkdir -p modules/local/gatk/haplotypecaller -mkdir -p modules/local/gatk/jointgenotyping -``` - -You can imagine how it'll be useful to have that optional directory for grouping modules at the toolkit level. - -### 3.2. Create file stubs for the process modules - -Now let's make the file stubs to put the code into. - -```bash -touch modules/local/gatk/haplotypecaller/main.nf -touch modules/local/gatk/jointgenotyping/main.nf -``` - -### 3.3. Move the process code to the module files - -And finally, move the code for each process to the corresponding `main.nf` file, making sure to copy the shebang line too each time. - -### 3.3.1. GATK_HAPLOTYPECALLER module - -```groovy title="hello-modules/modules/local/gatk/haplotypecaller/main.nf" linenums="1" -#!/usr/bin/env nextflow - -/* - * Call variants with GATK HaplotypeCaller - */ -process GATK_HAPLOTYPECALLER { - - container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867" - conda "bioconda::gatk4=4.5.0.0" - - publishDir params.outdir, mode: 'symlink' - - input: - tuple path(input_bam), path(input_bam_index) - path ref_fasta - path ref_index - path ref_dict - path interval_list - - output: - path "${input_bam}.g.vcf" , emit: vcf - path "${input_bam}.g.vcf.idx" , emit: idx - - script: - """ - gatk HaplotypeCaller \ - -R ${ref_fasta} \ - -I ${input_bam} \ - -O ${input_bam}.g.vcf \ - -L ${interval_list} \ - -ERC GVCF - """ -} -``` - -### 3.3.2. GATK_JOINTGENOTYPING module - -```groovy title="hello-modules/modules/local/gatk/jointgenotyping/main.nf" linenums="1" -#!/usr/bin/env nextflow - -/* - * Combine GVCFs into GenomicsDB datastore and run joint genotyping to produce cohort-level calls - */ -process GATK_JOINTGENOTYPING { - - container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867" - conda "bioconda::gatk4=4.5.0.0" - - publishDir params.outdir, mode: 'symlink' - - input: - path all_gvcfs - path all_idxs - path interval_list - val cohort_name - path ref_fasta - path ref_index - path ref_dict - - output: - path "${cohort_name}.joint.vcf" , emit: vcf - path "${cohort_name}.joint.vcf.idx" , emit: idx - - script: - def gvcfs_line = all_gvcfs.collect { gvcf -> "-V ${gvcf}" }.join(' ') - """ - gatk GenomicsDBImport \ - ${gvcfs_line} \ - -L ${interval_list} \ - --genomicsdb-workspace-path ${cohort_name}_gdb - - gatk GenotypeGVCFs \ - -R ${ref_fasta} \ - -V gendb://${cohort_name}_gdb \ - -L ${interval_list} \ - -O ${cohort_name}.joint.vcf - """ -} -``` - -### 3.4. Add import declarations to the workflow `main.nf` file - -Now all that remains is to add the import statements: - -_Before:_ - -```groovy title="hello-modules/main.nf" linenums="3" -// Include modules -include { SAMTOOLS_INDEX } from './modules/local/samtools/index/main.nf' - -workflow { -``` - -_After:_ - -```groovy title="hello-modules/main.nf" linenums="3" -// Include modules -include { SAMTOOLS_INDEX } from './modules/local/samtools/index/main.nf' -include { GATK_HAPLOTYPECALLER } from './modules/local/gatk/haplotypecaller/main.nf' -include { GATK_JOINTGENOTYPING } from './modules/local/gatk/jointgenotyping/main.nf' - -workflow { -``` - -### 3.5. Run the workflow to verify that everything still works as expected - -Look at that short `main.nf` file! Let's run it once last time. - -```bash -nextflow run main.nf -profile my_laptop,demo -resume -``` - -Yep, everything still works, including the resumability of the pipeline. - -```console title="Output" -N E X T F L O W ~ version 24.02.0-edge - -┃ Launching `main.nf` [tiny_blackwell] DSL2 - revision: 0ce0cd0c04 - -[62/21cdc5] SAMTOOLS_INDEX (1) | 3 of 3, cached: 3 ✔ -[c6/8c6c30] GATK_HAPLOTYPECALLER (2) | 3 of 3, cached: 3 ✔ -[38/82b2e2] GATK_JOINTGENOTYPING | 1 of 1, cached: 1 ✔ -``` - -Congratulations, you've done all this work and absolutely nothing has changed to how the pipeline works! - -Jokes aside, now your code is more modular, and if you decide to write another pipeline that calls on one of those processes, you just need to type one short import statement to use the relevant module. -This is better than just copy-pasting the code, because if later you decide to improve the module, all your pipelines will inherit the improvements. - -### Takeaway - -You know how to modularize multiple processes in a workflow. - -### What's next? - -Learn to add tests to your pipeline using the nf-test framework. diff --git a/docs/hello_nextflow/07_hello_config.md b/docs/hello_nextflow/06_hello_config.md similarity index 96% rename from docs/hello_nextflow/07_hello_config.md rename to docs/hello_nextflow/06_hello_config.md index e3db8786..232ef7ea 100644 --- a/docs/hello_nextflow/07_hello_config.md +++ b/docs/hello_nextflow/06_hello_config.md @@ -867,51 +867,42 @@ As it turns out, there's a lot of overlap between this kind of configuration and [TODO] NO JUST STICK IT STRAIGHT INTO A PARAMS JSON -### 5.4. Using a parameter file +### 5.1. Using a parameter file [TODO] UPDATE CODE -We provide a parameter file in the current directory, called `demo-params.json`, which contains key-value pairs for all of the parameters our workflow expects. -The values are the same input files and reference files we've been using so far. +Another convenient way to provide parameter values without having to modify the source code or putting a lot in the command line (which is error prone) is to use a parameter file. + +We provide an example parameter file in the current directory, called `demo-params.json`, which contains a key-value pair for the input our workflow expects. ```json title="demo-params.json" linenums="1" { - "reads_bam": "data/sample_bams.txt", - "outdir": "results_genomics", - "reference": "data/ref/ref.fasta", - "reference_index": "data/ref/ref.fasta.fai", - "reference_dict": "data/ref/ref.dict", - "intervals": "data/ref/intervals.bed", - "cohort_name": "family_trio" + "greeting": "Dobrý den" } ``` To run the workflow with this parameter file, simply add `-params-file demo-params.json` to the base command. ```bash -nextflow run main.nf -profile my_laptop -params-file demo-params.json +nextflow run hello-world.nf -params-file demo-params.json ``` It works! And as expected, this produces the same outputs as previously. ```console title="Output" - N E X T F L O W ~ version 24.10.0 - - ┃ Launching `main.nf` [marvelous_mandelbrot] DSL2 - revision: 328869237b - -executor > local (7) -[63/23a827] SAMTOOLS_INDEX (1) [100%] 3 of 3 ✔ -[aa/60aa4a] GATK_HAPLOTYPECALLER (2) [100%] 3 of 3 ✔ -[35/bda5eb] GATK_JOINTGENOTYPING [100%] 1 of 1 ✔ +[TODO] ``` -This is great because, with the parameter file in hand, we'll now be able to provide parameter values at runtime without having to type massive command lines **and** without modifying the workflow nor the default configuration. +[TODO] CONCLUDE + +This may seem like overkill when you only have a single parameter to specify, but some pipelines expect dozens of parameters. +In those cases, using a parameter file will allow us to provide parameter values at runtime without having to type massive command lines and without modifying the workflow. That being said, it was nice to be able to demo the workflow without having to keep track of filenames and such. Let's see if we can use a profile to replicate that behavior. -### 5.6. Create a demo profile +### 5.2. Create a demo profile -[TODO] DECIDE IF WE KEEP THIS; IF SO UPDATE TEXT +[TODO] DECIDE IF WE KEEP THIS; IF SO UPDATE TEXT ;; BUT MAYBE PUT THIS FIRST Yes we can! We just need to retrieve the default parameter declarations as they were written in the original workflow (with the `params.*` syntax) and copy them into a new profile that we'll call `demo`. diff --git a/temp.md b/temp.md new file mode 100644 index 00000000..31121c7f --- /dev/null +++ b/temp.md @@ -0,0 +1,32 @@ +From a technical standpoint, you can create a module simply by copying the process definition into its own file, and you can name that file anything you want. +However, the Nextflow community has adopted certain conventions for code organization, influenced in large part by the [nf-core](https://nf-co.re) project (which we'll cover later in this training series). + +The convention for Nextflow modules is that the process definition should be written to a standalone file named `main.nf`, stored in a directory structure with three to four levels: + +```console title="Directory structure" +modules +└── local + └── () + └── + └── main.nf +``` + +By convention, all modules are stored in a directory named `modules`. +Additionally, the convention distinguishes _local_ modules (which are part of your project) from _remote_ modules contained in remote repositories. + +The next levels down are named after the toolkit (if there is one) then the tool itself. +If the process defined in the module invokes more than one tool, as the GATK_JOINTGENOTYPING does in our example workflow, the name of the module can be the name of the method, or something to that effect. + +For example, the module we create for the `SAMTOOLS_INDEX` process will live under `modules/local/samtools/index/`. + +```console title="Directory structure" +modules +└── local + └── samtools + └── index + └── main.nf +``` + +!!!note + + We will cover remote modules later in this training, when we introduce the [nf-core library of modules](https://nf-co.re/modules/).