Running the pipeline

Requirements

Recommended hardware

CPU: >10 cores per sample
Memory: 6GB per core

Note

Running the pipeline with other/less resources may work, but has not been tested.

Software

python, version 3.9 or newer
pip3
virtuelenv
singularity

If this is available installation is easily executed with pip.

Installation

A list of releases for Marple - Inherited Cancer pipeline can be found on github or in the Changelog.

Clone the Marple git repo

To clone the repository and fetch the pipeline.

# Set up a working directory path
WORKING_DIRECTORY="/path_working_to_directory"

# Set version
VERSION="v0.6.0"

# Clone selected version
git clone --branch ${VERSION} https://github.com/clinical-genomics-uppsala/marple_rd_tc.git ${WORKING_DIRECTORY}

Create python environment

To run the Marple pipeline a python virtual environment is needed. Create a virtual environment and then install pipeline requirements specified in requirements.txt.

# Create a new virtual environment
python3.9 -m venv ${WORKING_DIRECTORY}/virtual/environment

# Enter working directory
cd ${WORKING_DIRECTORY}

# Activate python environment
source virtual/environment/bin/activate

# Install requirements
pip install -r requirements.txt

This will install all required softwares needed to run the pipeline in an virtual environment which you will have to activate before running the pipeline each time.

Input files

Four different files need to be adapted to your compute environment and sequence data, samples.tsv, units.tsv, config.yaml and resources.yaml.

Samples and units

A samples.tsv and an units.tsv file which contain sample information, sequencing meta information and location of fastq-files are needed. Specification of what columns are needed can be found at Marple schemas. The .tsv-files can also be generated with the help of hydra-genetics tools: hydra-genetics create-input-files.

hydra-genetics create-input-files -d /path/to/fastq/ --tc 1.0 -a 'adaptersequence1,adaptersequence2' --sample-type N  --sample-regex "^([0-9]{4})_S"

# If the fastq file is missing barcode in the header a default barcode can be set with -b NNNNNNNN

Config

The bare-bones of the config file can be found in the config/config.yaml. This need to be adapted to match the local paths to reference files, bedfiles, caches etc on your system.

Expand to view current config.yaml

---
resources: "config/resources.yaml"
samples: "samples.tsv"
units: "units.tsv"
output: "{{PATH_TO_REPO}}/marple_rd_tc/config/output_files.yaml"

default_container: "docker://hydragenetics/common:3.1.1"

modules:
  alignment: "v0.7.0"
  annotation: "v1.2.0"
  cnv_sv: "v2.0.0"
  compression: "v2.1.0"
  filtering: "v1.1.0"
  prealignment: "v1.4.0"
  qc: "v0.6.0"
  snv_indels: "v2.0.0"

reference:
  annovar: "/opt/ohpc/pub/annotators/annovar/2025Feb18"
  fasta: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta"
  fai: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.fai"
  sites: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.known_indels.vcf.gz" #Needed?
  design_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230608_hg38_TE-98982205-wPGRS.merged.bed"
  design_intervals: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230608_hg38_TE-98982205-wPGRS.merged.intervals"
  exon_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230602_hg38_coding_exons.bed"
  exon_intervals: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230602_hg38_coding_exons.intervals"
  pgrs_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230707_hg38_pgrs_snps.bed"
  skip_chrs:
    - "chrX"
    - "chrY"
    - "chrM"

trimmer_software: fastp_pe

bcftools_view_deepsomatic:
  extra: "--exclude 'QUAL<1 | FORMAT/DP<10 | FORMAT/GQ<2 | FORMAT/VAF>0.4 && FORMAT/VAF<0.6 | FORMAT/VAF<0.01 | FORMAT/VAF>0.99' " # remove gnomad of some AF threshold? -T ^/projects/wp3/nobackup/Workspace/mosaic/hardregNgnomad_hg38.bed

bwa_mem:
  container: "docker://hydragenetics/bwa_mem:0.7.17"
  amb: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.amb"
  ann: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.ann"
  bwt: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.bwt"
  pac: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.pac"
  sa: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.sa"

deepmosaic_draw:
  container: "library://sanglee8888/deepmosaic/deepmosaic:sha256.28b3dd3b3ab1daa63b53d627c3fba3e6b4da38447ccdb9c7abc073718ff9453e"

deepmosaic_predict:
  container: "library://sanglee8888/deepmosaic/deepmosaic:sha256.28b3dd3b3ab1daa63b53d627c3fba3e6b4da38447ccdb9c7abc073718ff9453e"

deepvariant:
  container: "docker://hydragenetics/deepvariant:1.4.0"
  model_type: "WES"
  output_gvcf: True

deepsomatic_t_only:
  container: "docker://google/deepsomatic:1.8.0"
  extra: "--use_default_pon_filtering=true make_examples_extra_args='vsc_min_fraction_snps=0.029,vsc_min_fraction_indels=0.05' " #--use_default_pon_filtering=true or --pon_filtering="pon.vcf" --process_somatic=true or ""
  model: "WGS_TUMOR_ONLY" # WGS,WES,PACBIO,ONT,FFPE_WGS,FFPE_WES,WGS_TUMOR_ONLY,PACBIO_TUMOR_ONLY,ONT_TUMOR_ONLY

exomedepth_call:
  container: "docker://hydragenetics/exomedepth:1.1.15"
  bedfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230706_hg38_TE-98982205-wPGRS.merged_200bpwindows.bed"
  exonsfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/exons_GRCh38_ensembl109.txt"
  genesfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/genes_GRCh38_ensembl109.txt"
  genome_version: "hg38"

exomedepth_export:
  container: "docker://hydragenetics/exomedepth:1.1.15"

export_qc_bedtools_intersect:
  extra: " -wb "

export_qc_xlsx_report:
  wanted_transcripts: "/projects/wp3/nobackup/TwistCancer/Bedfiles/wanted_transcripts.txt"

fastp_pe:
  container: "docker://hydragenetics/fastp:0.20.1"

fastqc:
  container: "docker://hydragenetics/fastqc:0.11.9"

melt:
  container: "/projects/wp3/Software/MELTv2.2.2/MELT_v2.2.2.sif"
  bed: "/projects/wp3/Software/MELTv2.2.2/add_bed_files/Hg38/Hg38.genes.bed"
  extra: ""
  mei: "/projects/wp3/Software/MELTv2.2.2/mei_list.txt"

mosaicforecast_input:
  bcftools_filter: "--exclude 'FILTER!=\"PASS\" || FORMAT/DP<8'"

mosaicforecast_phasing:
  container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
  f_format: "bam" # cram/bam
  umap: "/usr/local/bin/k24.umap.wg.bw"

mosaicforecast_readlevel:
  container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
  f_format: "bam" # cram/bam
  umap: "/usr/local/bin/k24.umap.wg.bw"

mosaicforecast_genotype_prediction:
  container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
  model_trained_SNP: "/data/ref_data/wp3/mosaicforecast/150xRFmodel_addRMSK_Refine.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
  model_trained_INS: "/data/ref_data/wp3/mosaicforecast/insertions_250x.RF.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
  model_trained_DEL: "/data/ref_data/wp3/mosaicforecast/deletions_250x.RF.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
  model_type_SNP: "Refine"  # phase/refine
  model_type_INS: "Phase"  # phase/refine
  model_type_DEL: "Phase"  # phase/refine

mosdepth_bed:
  container: "docker://hydragenetics/mosdepth:0.3.2"
  thresholds: "10,20,50"

multiqc:
  container: "docker://hydragenetics/multiqc:1.11"
  reports:
    DNA:
      included_unit_types: ["T", "N"]
      config: "{{PATH_TO_REPO}}/marple_rd_tc/config/multiqc_config.yaml"
      qc_files:
        - "prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastp.json"
        - "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_{read}_fastqc.zip"
        - "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
        - "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
        - "qc/picard_collect_gc_bias_metrics/{sample}_{type}.gc_bias.summary_metrics"
        - "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
        - "qc/picard_collect_insert_size_metrics/{sample}_{type}.insert_size_metrics.txt"
        - "qc/picard_collect_multiple_metrics/{sample}_{type}.{ext}"
        - "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
        - "qc/mosdepth_bed/{sample}_{type}.mosdepth.summary.txt"
        - "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz"
        - "qc/mosdepth_bed/{sample}_{type}.mosdepth.region.dist.txt"
        - "qc/mosdepth_bed/{sample}_{type}.mosdepth.global.dist.txt"

picard_collect_alignment_summary_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_duplication_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_gc_bias_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_hs_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_insert_size_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_multiple_metrics:
  container: "docker://hydragenetics/picard:2.25.4"
  output_ext:
    - "gc_bias.pdf" #collect_gc_bias
    - "gc_bias.summary_metrics"
    - "gc_bias.detail_metrics"
    - "alignment_summary_metrics" # collect_alignment_summary
    - "insert_size_metrics" # collect_insert_size
    - "insert_size_histogram.pdf"
    - "error_summary_metrics" # collect_sequencing_artefact
    - "bait_bias_detail_metrics"
    - "bait_bias_summary_metrics"
    - "pre_adapter_detail_metrics"
    - "pre_adapter_summary_metrics"
    - "quality_distribution_metrics" # quality_score_distribution
    - "quality_distribution.pdf"
    - "quality_yield_metrics" # quality yield metrics

picard_mark_duplicates:
  container: "docker://hydragenetics/picard:2.25.4"

vep:
  container: "docker://ensemblorg/ensembl-vep:release_110.1"
  vep_cache: "/data/ref_genomes/VEP"
  extra: "--assembly GRCh38 --check_existing --pick --variant_class --everything"

vt_decompose:
  container: "docker://hydragenetics/vt:2015.11.10"

vt_normalize:
  container: "docker://hydragenetics/vt:2015.11.10"

Resources

An resources.yaml file can also be found in the config/-folder. This is adapted to the Uppsala Clinical Genomics' compute cluster but can be used as an indication of resources needed for the different programs.

Expand to view current resources.yaml

default_resources:
  threads: 1
  time: "12:00:00"
  mem_mb: 6144
  mem_per_cpu: 6144
  partition: "core"

bwa_mem:
  mem_mb: 122880
  mem_per_cpu: 6144
  threads: 8

deepvariant:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10
  time: "12:00:00"

deepsomatic_t_only:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10
  time: "12:00:00"

fastp_pe:
  threads: 1
  mem_mb: 6144
  mem_per_cpu: 6144

fastqc:
  threads: 2
  mem_mb: 12288
  mem_per_cpu: 6144

mosdepth_bed:
  mem_mb: 36864
  threads: 4

samtools_sort:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10

samtools_view:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10

vep:
  threads: 5
  mem_mb: 30720
  mem_per_cpu: 6144

Run command

# Activate the virtual environment
source virtual/environment/bin/activate

# Run snakemake command with the extra config parameter called sequenceid
snakemake --profile snakemakeprofile --configfile config.yaml --config sequenceid="230202-test" -s /path/to/marple/workflow/Snakefile --config PATH_TO_REPO=/path/to/repo/