Add a process¶
This section provides the guidelines for adding a new process in the main.nf
file such that it allows the automatic generation of the config
files and recipes to build the Singularity and Docker containers. Note that a geniac command line interface is provided to Geniac CLI and ensure that the pipeline is compliant with the following guidelines.
Note
All the examples below are taken from the geniac demo pipeline. You can clone this repository and reproduce what is presented. This geniac demo is fully functional.
Structure of a process¶
Important
Consider that one process invokes only one tool.
Each process must have a label directive. The label name may be different of the process name. For example:
process fastqc {
label 'fastqc'
label 'lowMem'
label 'lowCpu'
tag "${prefix}"
publishDir "${params.outDir}/fastqc", mode: 'copy'
input:
set val(prefix), file(reads) from rawReadsFastqcCh
output:
file "*_fastqc.{zip,html}" into fastqcResultsCh
file "v_fastqc.txt" into fastqcVersionCh
script:
"""
fastqc -q $reads
fastqc --version > v_fastqc.txt
"""
}
Having a label is essential such that it makes it possible to automatically generate the configuration files conda.config
, multiconda.config
, singularity.config
, docker.config
, path.config
and multipath.config
. This configuration files use the withLabel
process selector. We will explain in the section Guidelines that the name of the label must follow specific rules.
Important
Pay a lot of attention to declare the label for each process since the automatic generation of configuration files mentionned above along with the singularity / docker recipes and containers relies on the label name by parsing the conf/geniac.config
file from the source code.
Note
Why we used withLabel
rather than withName
as process selector in the configutation files? Using withLabel
offers the possibility to use the same exact same tool within two or more different processes with different options. This is a big advantage especially when you use containers as you don’t have to build one container per process but the same container can be shared between processes.
Answer these questions first¶
Where is the tool available?¶
Is it just a standard Unix command?¶
Yes, it is something like grep, sed, cat, etc., then see Standard UNIX command.
Is it available in Conda?¶
Yes, the tool is available in conda and can be easily installed from bioconda, conda-forge channels, then see Easy install with Conda.
Yes, but it cannot be easily installed as the order of the channels matters or it requires
dependencies
and/orpip
directives in the conda recipe, then see Custom install with conda.
Is it available only as a binary or as an executable script?¶
Yes, it is available as a binary (but without source code available) or as an executable script (shell, python, perl), then see Binary or executable script.
Is the source code available?¶
Yes, then see Install from source code.
Is it available as R packages using renv?¶
Yes, then see R packages using renv.
Have you still not answered yes?¶
Probably not, otherwise, you would not be reading this. This means that the tool can fall in any of these categories:
it is provided as deb, rpm packages or any executable installer,
it is a windows executable that needs mono to be run,
it is whatever that needs a custom installation procedure.
Then see Custom install.
Does my tool require some environment variables to be set?¶
If Yes, see Environment variables.
How many CPUs and memory resources does the tool require?¶
See Resource tuning to define the informatics resources necessary to run your process.
Guidelines¶
Standard UNIX command¶
This is an easy one.
prerequisite¶
The command must work on standard UNIX system.
label¶
Use always label 'onlyLinux'
example¶
process standardUnixCommand {
label 'onlyLinux'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/standardUnixCommand", mode: 'copy'
input:
file hello from helloWorldOutputCh
output:
file "bonjourMonde.txt"
script:
"""
sed -e 's/Hello World/Bonjour Monde/g' ${hello} > bonjourMonde.txt
"""
}
container¶
You have nothing to do, the install process will build the recipes and images for you.
Easy install with Conda¶
prerequisite¶
Of course, the tool has to be available in a conda channel.
Edit the file conf/geniac.config
and add for example rmarkdown = "conda-forge::r-markdown=0.8=r351h96ca727_1003
in the section params.geniac.tools
as follows:
params {
geniac{
tools {
rmarkdown = "conda-forge::r-markdown=0.8=r351h96ca727_1003`
}
}
}
The syntax follows the pattern from the conda package naming softName = "condaChannelName::softName=version=buildString"
.
Note that for some tools, other conda dependencies are required and can be added as follows:
params {
geniac{
tools {
fastqc = "conda-forge::openjdk=8.0.192=h14c3975_1003 bioconda::fastqc=0.11.6=2"
}
}
}
Note also that you can add other conda dependencies from other tools that have been set in the section params.geniac.tools
. This ensures the consistency of the version of tools between tools whenever this is required. To do so, just add the variable in the list such as ${params.geniac.tools.python}
, as shown below:
params {
geniac{
tools {
fastqc = "${params.geniac.tools.python} conda-forge::openjdk=8.0.192=h14c3975_1003 bioconda::fastqc=0.11.6=2"
}
}
}
label¶
The label directive must have the exact same name as given in the params.geniac.tools
section. The label must not contain the prefix renv which is reserved for a tool with R packages using renv.
example¶
Add your process in the main.nf
. It can take any name (which is not necessarily the same name as the software that will be called on command line) provided it follows the Naming convention.
process fastqc {
label 'fastqc'
label 'lowMem'
label 'lowCpu'
tag "${prefix}"
publishDir "${params.outDir}/fastqc", mode: 'copy'
input:
set val(prefix), file(reads) from rawReadsFastqcCh
output:
file "*_fastqc.{zip,html}" into fastqcResultsCh
file "v_fastqc.txt" into fastqcVersionCh
script:
"""
fastqc -q $reads
fastqc --version > v_fastqc.txt
"""
}
container¶
In most of the case, you will have nothing to do. However, some tools depend on packages that have to be installed from the Linux distributions used for the containers. For example, fastqc
requires some fonts to be installed, then add the list of packages that will have to be installed with dnf (this is the Dandified YUM command which is the package management utility for the Linux distributions used for the containers). To do so, edit the file conf/geniac.config
and add for example fastqc = 'fontconfig dejavu*'
in the section params.geniac.containers.yum
as follows:
geniac{
containers {
yum {
fastqc = 'fontconfig dejavu*'
}
}
}
Warning
Be careful that you use the exact same name in params.geniac.containers.yum
, params.geniac.tools
and label otherwise, the container will not work.
If you need to Add custom commands and environment variables in the docker/singularity recipes automatically generated by geniac, this can be done using the following scopes associated to the label of the tool:
params.geniac.containers.cmd.post
: to define commands which will be executed at the end of the default commands generated by geniac.params.geniac.containers.cmd.envCustom
: to define environment variables which will be set inside the docker and singularity images.
Custom install with conda¶
prerequisite¶
Of course, the tool has to be available in a conda channel.
Write the custom conda recipe in the directory recipes/conda
, for example add the file trickySoftware.yml
:
name: trickySoftware_env
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- python=3.7.8=h6f2ec95_1_cpython
- pip
- pip:
- numpy==1.19.2
Warning
The yml
file with the conda recipe must follow the following guidelines:
Name the file using the name of the label (e.g. if the label is
trickySoftware
, the file must be namedtrickySoftware.yml
)Choose a unique name for your conda environment.
Each conda package has the naming pattern
softName = "condaChannelName::softName=version=buildString"
.If you need
pip
to install some packages, addpip
in your dependencies and use the patternsoftName==version
for each package to be installed withpip
.
Edit the file conf/geniac.config
and add for example trickySoftware = "${projectDir}/recipes/conda/trickySoftware.yml
in the section params.geniac.tools
as follows:
geniac{
tools {
trickySoftware = "${projectDir}/recipes/conda/trickySoftware.yml"
}
}
label¶
The label directive must have the exact same name as given in the params.geniac.tools
section. The label must not contain the prefix renv which is reserved for a tool with R packages using renv.
example¶
Add your process in the main.nf
. It can take any name (which is not necessarily the same name as the software that will be called on command line) provided it follows the Naming convention.
process trickySoftware {
label 'trickySoftware'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/trickySoftware", mode: 'copy'
output:
file "trickySoftwareResults.txt"
script:
"""
python --version > trickySoftwareResults.txt 2>&1
"""
}
container¶
In most of the case, you will have nothing to do. However, some tools depend on packages that have to be installed from the Linux distributions used for the containers. For example, fastqc
requires some fonts to be installed, then add the list of packages that will have to be installed with dnf (this is the Dandified YUM command which is the package management utility for the Linux distributions used for the containers). To do so, edit the file conf/geniac.config
and add for example fastqc = 'fontconfig dejavu*'
in the section params.geniac.containers.yum
as follows:
geniac{
containers {
yum {
myFavouriteTool = 'gsl blas'
}
}
}
If you need to Add custom commands and environment variables in the docker/singularity recipes automatically generated by geniac, this can be done using the following scopes associated to the label of the tool:
params.geniac.containers.cmd.post
: to define commands which will be executed at the end of the default commands generated by geniac.params.geniac.containers.cmd.envCustom
: to define environment variables which will be set inside the docker and singularity images.
Warning
Be careful that you use the exact same name in params.geniac.containers.yum
, params.geniac.tools
and label, otherwise, the container will not work.
Binary or executable script¶
prerequisite¶
bin/
directory of the pipeline.read
and execute
UNIX permissions. It must work on a UNIX system.label¶
Use label 'onlyLinux'
if this is a bash script or define a new tool with the expected programming language to run the script of binary (e.g. label 'python'
).
example¶
Add your process in the main.nf
. It can take any name (which is not necessarily the same name as the software that will be called on command line) provided it follows the Naming convention.
process execBinScript {
label 'onlyLinux'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/execBinScript", mode: 'copy'
output:
file "execBinScriptResults_*"
script:
"""
apMyscript.sh > execBinScriptResults_1.txt
someScript.sh > execBinScriptResults_2.txt
"""
}
Note
apMyscript.sh
is so named with ap prefix since it has been developed for the pipeline while someScript.sh
does not have this prefix as it is a third-party script (see Naming convention).
container¶
You have nothing to do, the install process will build the recipes and images for you.
Install from source code¶
prerequisite¶
First, you have to retrieve the source code and add it in a directory in the modules/fromSource
directory. Create the modules/fromSource
directory if needed. For example, add the source code of the helloWorld
tool in modules/fromSource/helloWorld
directory. This directory can be added as a git submodule (see this tutorial).
Then comes the tricky part. Add in the file modules/fromSource/CMakeLists.txt
the ExternalProject_Add function from Cmake.
ExternalProject_Add(
helloWorld
SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/helloWorld
CMAKE_ARGS
-DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/externalProject/bin)
Important
Always use the variable ${CMAKE_CURRENT_SOURCE_DIR}
in the SOURCE_DIR
directive, for example SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/helloWorld
Always install the binary in ${CMAKE_BINARY_DIR}/externalProject/bin)
(note that CMAKE_BINARY_DIR
is actually the build directory you have created to configure and build the pipeline, see Installation).
Important
Always create another CMakeLists.txt
file in the folder which stores the source code of the tool. For example, create the modules/fromSource/helloWorld/CMakeLists.txt
file which will explain how the source code must be installed. Depending on the source code you added, refer to the Cmake documentation to correctly write the CMakeLists.txt
file.
Note
Installation from source code offers a great flexibility as the software developer can control everything during the installation process. However, this obviously requires more configuration. In particular, the software developer has to be fluent with Cmake in order to tackle specific use cases, see Install tools from source: more examples for more details.
label¶
The label will be the same name as the directory you added the source code, for example helloWorld
. The label must not contain the prefix renv which is reserved for a tool with R packages using renv.
example¶
Add your process in the main.nf
. It can take any name (which is not necessarily the same name as the software that will be called on command line) provided it follows the Naming convention.
process helloWorld {
label 'helloWorld'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/helloWorld", mode: 'copy'
output:
file "helloWorld.txt" into helloWorldOutputCh
script:
"""
helloWorld > helloWorld.txt
"""
}
container¶
You have nothing to do, the install process will build the recipes and images for you.
If you need to Add custom commands and environment variables in the docker/singularity recipes automatically generated by geniac, this can be done using the following scopes associated to the label of the tool:
params.geniac.containers.cmd.post
: to define commands which will be executed at the end of the default commands generated by geniac.params.geniac.containers.cmd.envCustom
: to define environment variables which will be set inside the docker and singularity images.
R packages using renv¶
The renv package helps you to create reproducible environments for your R projects. The renv.lock
lockfile records the state of your project’s private library, and can be used to restore the state of that library as required. geniac
can use a renv.lock
lockfile to install all the package dependencies needed by your R environment.
prerequisite¶
You will need to:
create the conda recipes in the folder
recipes/conda
which defines which R version you want to use.add the label with the three scopes
yml
,env
andbioc
, in the sectionparams.geniac.tools
of the fileconf/geniac.config
.copy the
renv.lock
file in a subfolder with the name of the label inside the folderrecipes/dependencies/
.
label¶
The label directive must have the exact same name as given in the params.geniac.tools
section. The label must contain the prefix renv.
example¶
Adding a tool with R packages using renv requires two process to be defined. Therefore, the complete guidelines are descibed in the section R with reproducible environments using renv package.
container¶
You have nothing to do, the install process will build the recipes and images for you.
If you need to Add custom commands and environment variables in the docker/singularity recipes automatically generated by geniac, this can be done using the following scopes associated to the label of the tool:
params.geniac.containers.cmd.post
: to define commands which will be executed at the end of the default commands generated by geniac.params.geniac.containers.cmd.envCustom
: to define environment variables which will be set inside the docker and singularity images.
Custom install¶
prerequisite¶
Create a folder in recipes/dependencies/
with the label of your tool, for example recipes/dependencies/alpine
. Add in this folder your installer file (deb, rpm or whatever) in the recipes/dependencies/
directory along with any other files that could be needed especially to build the container.
label¶
Choose any name you want.
example¶
Add your process in the main.nf
. It can take any name (which is not necessarily the same name as the software that will be called on command line) provided it follows the Naming convention.
process alpine {
label 'alpine'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/alpine", mode: 'copy'
input:
val x from oneToFiveCh
output:
file "alpine_*"
script:
"""
source ${projectDir}/env/alpine.env
echo "Hello from alpine: \$(date). This is very high here: \${PEAK_HEIGHT}!" > alpine_${x}.txt
"""
}
container¶
This is the only case you will have to write the recipe yourself. The recipe should have the same name as the label with the suffix being either .def
for singularity and .Dockerfile
for docker. Save your recipes the folders recipes/singularity
and recipes/docker
respectively. For example, the alpine.def
recipe looks like this:
Bootstrap: docker
From: alpine:3.7
%setup
mkdir -p ${SINGULARITY_ROOTFS}/opt
%files
alpine/myDependency.sh /opt/myDependency.sh
%post
apk update
apk add bash
bash /opt/myDependency.sh
%environment
export LC_ALL=C
export PATH=/usr/games:$PATH
The alpine.Dockerfile
recipe looks like this:
FROM alpine:3.7
RUN mkdir -p /opt
ADD alpine/myDependency.sh /opt/myDependency.sh
RUN apk update
RUN apk add bash
RUN bash /opt/myDependency.sh
ENV LC_ALL C
ENV PATH /usr/games:$PATH
Important
As your recipe will very likely depends on files you added for example in the recipes/dependencies/alpine
directory, you can just mention the name of the files in the %files
section for singularity or with the ADD
directive for docker include the name of the label, for example alpine/myDependency.sh
.
Environment variables¶
Process specific¶
prerequisite
Add a file with the name of your process and the extension .env
in the folder env/
. For example, add env/alpine.env
:
#!/bin/bash
# required environment variables for alpine
PEAK_HEIGHT="4810m"
export PEAK_HEIGHT
example
In your process, source the env/alpine.env
and then use the variable you defined:
process alpine {
label 'alpine'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/alpine", mode: 'copy'
input:
val x from oneToFiveCh
output:
file "alpine_*"
script:
"""
source ${projectDir}/env/alpine.env
echo "Hello from alpine: \$(date). This is very high here: \${PEAK_HEIGHT}!" > alpine_${x}.txt
"""
}
Resource tuning¶
Anything related to process are defined in conf/process.config
.
Shared between processes¶
You can define generic labels for both CPU and memory (as you wish) in the file conf/process.config
. For example:
withLabel: minCpu {
cpus = 1
}
withLabel: lowCpu {
cpus = 2
}
withLabel: medCpu {
cpus = 4
}
withLabel: highCpu {
cpus = 8
}
withLabel: extraCpu {
cpus = 16
}
withLabel: minMem {
memory = 1.GB
}
withLabel: lowMem {
memory = 2.GB
}
withLabel: medMem {
memory = 8.GB
}
withLabel: highMem {
memory = 16.GB
}
withLabel: extraMem {
memory = 32.GB
}
Warning
Note that you must use a multi-line format as shown above, otherwise the linter Geniac CLI will throw an error.
Then, in any process, you can just set any label you need. For example:
process execBinScript {
label 'onlyLinux'
label 'minMem'
label 'minCpu'
publishDir "${params.outDir}/execBinScript", mode: 'copy'
output:
file "execBinScriptResults_*"
script:
"""
apMyscript.sh > execBinScriptResults_1.txt
someScript.sh > execBinScriptResults_2.txt
"""
}
Process specific¶
To optimize the resources used in a computing cluster, you may want to finely tune the CPU and memory asked by the process. Do do so, define the process selector withName
in the file conf/process.config
for your process of interest. For example:
withName:outputDocumentation {
memory = { checkMax( 2500.MB, 'memory' ) }
}
Tip
To assess what are the amount of resources used by you process refers to the Metrics section fron the Nextflow documentation.
Results¶
Use the publishDir
directive with the ${params.outDir}
parameters and organize your results as you wish. For example:
publishDir "${params.outDir}/execBinScript", mode: 'copy'