Running the pipeline
Requirements
Recommended hardware
- CPU: >10 cores per sample
- Memory: 6GB per core
Note
Running the pipeline with other/less resources may work, but has not been tested.
Software
- python, version 3.9 or newer
- pip3
- virtuelenv
- singularity
If this is available installation is easily executed with pip.
Installation
A list of releases for Marple - Inherited Cancer pipeline can be found on github or in the Changelog.
Clone the Marple git repo
To clone the repository and fetch the pipeline.
# Set up a working directory path
WORKING_DIRECTORY="/path_working_to_directory"
# Set version
VERSION="v0.6.0"
# Clone selected version
git clone --branch ${VERSION} https://github.com/clinical-genomics-uppsala/marple_rd_tc.git ${WORKING_DIRECTORY}
Create python environment
To run the Marple pipeline a python virtual environment is needed. Create a virtual environment and then install pipeline requirements specified in requirements.txt.
# Create a new virtual environment
python3.9 -m venv ${WORKING_DIRECTORY}/virtual/environment
# Enter working directory
cd ${WORKING_DIRECTORY}
# Activate python environment
source virtual/environment/bin/activate
# Install requirements
pip install -r requirements.txt
This will install all required softwares needed to run the pipeline in an virtual environment which you will have to activate before running the pipeline each time.
Input files
Four different files need to be adapted to your compute environment and sequence data, samples.tsv, units.tsv, config.yaml and resources.yaml.
Samples and units
A samples.tsv and an units.tsv file which contain sample information, sequencing meta information and location of fastq-files are needed. Specification of what columns are needed can be found at Marple schemas. The .tsv-files can also be generated with the help of hydra-genetics tools: hydra-genetics create-input-files.
hydra-genetics create-input-files -d /path/to/fastq/ --tc 1.0 -a 'adaptersequence1,adaptersequence2' --sample-type N --sample-regex "^([0-9]{4})_S"
# If the fastq file is missing barcode in the header a default barcode can be set with -b NNNNNNNN
Config
The bare-bones of the config file can be found in the config/config.yaml. This need to be adapted to match the local paths to reference files, bedfiles, caches etc on your system.
Expand to view current config.yaml
---
resources: "config/resources.yaml"
samples: "samples.tsv"
units: "units.tsv"
output: "{{PATH_TO_REPO}}/marple_rd_tc/config/output_files.yaml"
default_container: "docker://hydragenetics/common:3.1.1"
modules:
alignment: "v0.7.0"
annotation: "v1.2.0"
cnv_sv: "v2.0.0"
compression: "v2.1.0"
filtering: "v1.1.0"
prealignment: "v1.4.0"
qc: "v0.6.0"
snv_indels: "v2.0.0"
reference:
annovar: "/opt/ohpc/pub/annotators/annovar/2025Feb18"
fasta: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta"
fai: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.fai"
sites: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.known_indels.vcf.gz" #Needed?
design_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230608_hg38_TE-98982205-wPGRS.merged.bed"
design_intervals: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230608_hg38_TE-98982205-wPGRS.merged.intervals"
exon_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230602_hg38_coding_exons.bed"
exon_intervals: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230602_hg38_coding_exons.intervals"
pgrs_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230707_hg38_pgrs_snps.bed"
skip_chrs:
- "chrX"
- "chrY"
- "chrM"
trimmer_software: fastp_pe
bcftools_view_deepsomatic:
extra: "--exclude 'QUAL<1 | FORMAT/DP<10 | FORMAT/GQ<2 | FORMAT/VAF>0.4 && FORMAT/VAF<0.6 | FORMAT/VAF<0.01 | FORMAT/VAF>0.99' " # remove gnomad of some AF threshold? -T ^/projects/wp3/nobackup/Workspace/mosaic/hardregNgnomad_hg38.bed
bwa_mem:
container: "docker://hydragenetics/bwa_mem:0.7.17"
amb: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.amb"
ann: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.ann"
bwt: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.bwt"
pac: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.pac"
sa: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.sa"
deepmosaic_draw:
container: "library://sanglee8888/deepmosaic/deepmosaic:sha256.28b3dd3b3ab1daa63b53d627c3fba3e6b4da38447ccdb9c7abc073718ff9453e"
deepmosaic_predict:
container: "library://sanglee8888/deepmosaic/deepmosaic:sha256.28b3dd3b3ab1daa63b53d627c3fba3e6b4da38447ccdb9c7abc073718ff9453e"
deepvariant:
container: "docker://hydragenetics/deepvariant:1.4.0"
model_type: "WES"
output_gvcf: True
deepsomatic_t_only:
container: "docker://google/deepsomatic:1.8.0"
extra: "--use_default_pon_filtering=true make_examples_extra_args='vsc_min_fraction_snps=0.029,vsc_min_fraction_indels=0.05' " #--use_default_pon_filtering=true or --pon_filtering="pon.vcf" --process_somatic=true or ""
model: "WGS_TUMOR_ONLY" # WGS,WES,PACBIO,ONT,FFPE_WGS,FFPE_WES,WGS_TUMOR_ONLY,PACBIO_TUMOR_ONLY,ONT_TUMOR_ONLY
exomedepth_call:
container: "docker://hydragenetics/exomedepth:1.1.15"
bedfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230706_hg38_TE-98982205-wPGRS.merged_200bpwindows.bed"
exonsfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/exons_GRCh38_ensembl109.txt"
genesfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/genes_GRCh38_ensembl109.txt"
genome_version: "hg38"
exomedepth_export:
container: "docker://hydragenetics/exomedepth:1.1.15"
export_qc_bedtools_intersect:
extra: " -wb "
export_qc_xlsx_report:
wanted_transcripts: "/projects/wp3/nobackup/TwistCancer/Bedfiles/wanted_transcripts.txt"
fastp_pe:
container: "docker://hydragenetics/fastp:0.20.1"
fastqc:
container: "docker://hydragenetics/fastqc:0.11.9"
melt:
container: "/projects/wp3/Software/MELTv2.2.2/MELT_v2.2.2.sif"
bed: "/projects/wp3/Software/MELTv2.2.2/add_bed_files/Hg38/Hg38.genes.bed"
extra: ""
mei: "/projects/wp3/Software/MELTv2.2.2/mei_list.txt"
mosaicforecast_input:
bcftools_filter: "--exclude 'FILTER!=\"PASS\" || FORMAT/DP<8'"
mosaicforecast_phasing:
container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
f_format: "bam" # cram/bam
umap: "/usr/local/bin/k24.umap.wg.bw"
mosaicforecast_readlevel:
container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
f_format: "bam" # cram/bam
umap: "/usr/local/bin/k24.umap.wg.bw"
mosaicforecast_genotype_prediction:
container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
model_trained_SNP: "/data/ref_data/wp3/mosaicforecast/150xRFmodel_addRMSK_Refine.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
model_trained_INS: "/data/ref_data/wp3/mosaicforecast/insertions_250x.RF.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
model_trained_DEL: "/data/ref_data/wp3/mosaicforecast/deletions_250x.RF.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
model_type_SNP: "Refine" # phase/refine
model_type_INS: "Phase" # phase/refine
model_type_DEL: "Phase" # phase/refine
mosdepth_bed:
container: "docker://hydragenetics/mosdepth:0.3.2"
thresholds: "10,20,50"
multiqc:
container: "docker://hydragenetics/multiqc:1.11"
reports:
DNA:
included_unit_types: ["T", "N"]
config: "{{PATH_TO_REPO}}/marple_rd_tc/config/multiqc_config.yaml"
qc_files:
- "prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastp.json"
- "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_{read}_fastqc.zip"
- "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
- "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
- "qc/picard_collect_gc_bias_metrics/{sample}_{type}.gc_bias.summary_metrics"
- "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
- "qc/picard_collect_insert_size_metrics/{sample}_{type}.insert_size_metrics.txt"
- "qc/picard_collect_multiple_metrics/{sample}_{type}.{ext}"
- "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
- "qc/mosdepth_bed/{sample}_{type}.mosdepth.summary.txt"
- "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz"
- "qc/mosdepth_bed/{sample}_{type}.mosdepth.region.dist.txt"
- "qc/mosdepth_bed/{sample}_{type}.mosdepth.global.dist.txt"
picard_collect_alignment_summary_metrics:
container: "docker://hydragenetics/picard:2.25.4"
picard_collect_duplication_metrics:
container: "docker://hydragenetics/picard:2.25.4"
picard_collect_gc_bias_metrics:
container: "docker://hydragenetics/picard:2.25.4"
picard_collect_hs_metrics:
container: "docker://hydragenetics/picard:2.25.4"
picard_collect_insert_size_metrics:
container: "docker://hydragenetics/picard:2.25.4"
picard_collect_multiple_metrics:
container: "docker://hydragenetics/picard:2.25.4"
output_ext:
- "gc_bias.pdf" #collect_gc_bias
- "gc_bias.summary_metrics"
- "gc_bias.detail_metrics"
- "alignment_summary_metrics" # collect_alignment_summary
- "insert_size_metrics" # collect_insert_size
- "insert_size_histogram.pdf"
- "error_summary_metrics" # collect_sequencing_artefact
- "bait_bias_detail_metrics"
- "bait_bias_summary_metrics"
- "pre_adapter_detail_metrics"
- "pre_adapter_summary_metrics"
- "quality_distribution_metrics" # quality_score_distribution
- "quality_distribution.pdf"
- "quality_yield_metrics" # quality yield metrics
picard_mark_duplicates:
container: "docker://hydragenetics/picard:2.25.4"
vep:
container: "docker://ensemblorg/ensembl-vep:release_110.1"
vep_cache: "/data/ref_genomes/VEP"
extra: "--assembly GRCh38 --check_existing --pick --variant_class --everything"
vt_decompose:
container: "docker://hydragenetics/vt:2015.11.10"
vt_normalize:
container: "docker://hydragenetics/vt:2015.11.10"
Resources
An resources.yaml file can also be found in the config/-folder. This is adapted to the Uppsala Clinical Genomics' compute cluster but can be used as an indication of resources needed for the different programs.
Expand to view current resources.yaml
default_resources:
threads: 1
time: "12:00:00"
mem_mb: 6144
mem_per_cpu: 6144
partition: "core"
bwa_mem:
mem_mb: 122880
mem_per_cpu: 6144
threads: 8
deepvariant:
mem_mb: 61440
mem_per_cpu: 6144
threads: 10
time: "12:00:00"
deepsomatic_t_only:
mem_mb: 61440
mem_per_cpu: 6144
threads: 10
time: "12:00:00"
fastp_pe:
threads: 1
mem_mb: 6144
mem_per_cpu: 6144
fastqc:
threads: 2
mem_mb: 12288
mem_per_cpu: 6144
mosdepth_bed:
mem_mb: 36864
threads: 4
samtools_sort:
mem_mb: 61440
mem_per_cpu: 6144
threads: 10
samtools_view:
mem_mb: 61440
mem_per_cpu: 6144
threads: 10
vep:
threads: 5
mem_mb: 30720
mem_per_cpu: 6144
Run command
# Activate the virtual environment
source virtual/environment/bin/activate
# Run snakemake command with the extra config parameter called sequenceid
snakemake --profile snakemakeprofile --configfile config.yaml --config sequenceid="230202-test" -s /path/to/marple/workflow/Snakefile --config PATH_TO_REPO=/path/to/repo/