Running the pipeline

❗ Requirements

  • CPU: >10 cores per sample
  • Memory: 6GB per core

Note

Running the pipeline with other/less resources may work, but has not been tested.

Software

If this is available installation is easily executed with pip.

💻 Installation

A list of releases for Marple - Inherited Cancer pipeline can be found on github or in the Changelog.

Clone the Marple git repo

To clone the repository and fetch the pipeline.

# Set up a working directory path
WORKING_DIRECTORY="/path_working_to_directory"

# Set version
VERSION="v0.6.0"

# Clone selected version
git clone --branch ${VERSION} https://github.com/clinical-genomics-uppsala/marple_rd_tc.git ${WORKING_DIRECTORY}

Create python environment

To run the Marple pipeline a python virtual environment is needed. Create a virtual environment and then install pipeline requirements specified in requirements.txt.

# Create a new virtual environment
python3.9 -m venv ${WORKING_DIRECTORY}/virtual/environment

# Enter working directory
cd ${WORKING_DIRECTORY}

# Activate python environment
source virtual/environment/bin/activate

# Install requirements
pip install -r requirements.txt

This will install all required softwares needed to run the pipeline in an virtual environment which you will have to activate before running the pipeline each time.

📚 Input files

Four different files need to be adapted to your compute environment and sequence data, samples.tsv, units.tsv, config.yaml and resources.yaml.

Samples and units

A samples.tsv and an units.tsv file which contain sample information, sequencing meta information and location of fastq-files are needed. Specification of what columns are needed can be found at Marple schemas. The .tsv-files can also be generated with the help of hydra-genetics tools: hydra-genetics create-input-files.

hydra-genetics create-input-files -d /path/to/fastq/ --tc 1.0 -a 'adaptersequence1,adaptersequence2' --sample-type N  --sample-regex "^([0-9]{4})_S"

# If the fastq file is missing barcode in the header a default barcode can be set with -b NNNNNNNN

Config

The bare-bones of the config file can be found in the config/config.yaml. This need to be adapted to match the local paths to reference files, bedfiles, caches etc on your system.

Expand to view current config.yaml
---
resources: "config/resources.yaml"
samples: "samples.tsv"
units: "units.tsv"
output: "{{PATH_TO_REPO}}/marple_rd_tc/config/output_files.yaml"

default_container: "docker://hydragenetics/common:3.1.1"

modules:
  alignment: "v0.7.0"
  annotation: "v1.2.0"
  cnv_sv: "v2.0.0"
  compression: "v2.1.0"
  filtering: "v1.1.0"
  prealignment: "v1.4.0"
  qc: "v0.6.0"
  snv_indels: "v2.0.0"

reference:
  annovar: "/opt/ohpc/pub/annotators/annovar/2025Feb18"
  fasta: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta"
  fai: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.fai"
  sites: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.known_indels.vcf.gz" #Needed?
  design_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230608_hg38_TE-98982205-wPGRS.merged.bed"
  design_intervals: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230608_hg38_TE-98982205-wPGRS.merged.intervals"
  exon_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230602_hg38_coding_exons.bed"
  exon_intervals: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230602_hg38_coding_exons.intervals"
  pgrs_bed: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230707_hg38_pgrs_snps.bed"
  skip_chrs:
    - "chrX"
    - "chrY"
    - "chrM"

trimmer_software: fastp_pe

bcftools_view_deepsomatic:
  extra: "--exclude 'QUAL<1 | FORMAT/DP<10 | FORMAT/GQ<2 | FORMAT/VAF>0.4 && FORMAT/VAF<0.6 | FORMAT/VAF<0.01 | FORMAT/VAF>0.99' " # remove gnomad of some AF threshold? -T ^/projects/wp3/nobackup/Workspace/mosaic/hardregNgnomad_hg38.bed

bwa_mem:
  container: "docker://hydragenetics/bwa_mem:0.7.17"
  amb: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.amb"
  ann: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.ann"
  bwt: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.bwt"
  pac: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.pac"
  sa: "/data/ref_genomes/GRCh38/reference_grasnatter/homo_sapiens.fasta.sa"

deepmosaic_draw:
  container: "library://sanglee8888/deepmosaic/deepmosaic:sha256.28b3dd3b3ab1daa63b53d627c3fba3e6b4da38447ccdb9c7abc073718ff9453e"

deepmosaic_predict:
  container: "library://sanglee8888/deepmosaic/deepmosaic:sha256.28b3dd3b3ab1daa63b53d627c3fba3e6b4da38447ccdb9c7abc073718ff9453e"

deepvariant:
  container: "docker://hydragenetics/deepvariant:1.4.0"
  model_type: "WES"
  output_gvcf: True

deepsomatic_t_only:
  container: "docker://google/deepsomatic:1.8.0"
  extra: "--use_default_pon_filtering=true make_examples_extra_args='vsc_min_fraction_snps=0.029,vsc_min_fraction_indels=0.05' " #--use_default_pon_filtering=true or --pon_filtering="pon.vcf" --process_somatic=true or ""
  model: "WGS_TUMOR_ONLY" # WGS,WES,PACBIO,ONT,FFPE_WGS,FFPE_WES,WGS_TUMOR_ONLY,PACBIO_TUMOR_ONLY,ONT_TUMOR_ONLY

exomedepth_call:
  container: "docker://hydragenetics/exomedepth:1.1.15"
  bedfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/Twist_Cancer_230706_hg38_TE-98982205-wPGRS.merged_200bpwindows.bed"
  exonsfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/exons_GRCh38_ensembl109.txt"
  genesfile: "/projects/wp3/nobackup/TwistCancer/Bedfiles/genes_GRCh38_ensembl109.txt"
  genome_version: "hg38"

exomedepth_export:
  container: "docker://hydragenetics/exomedepth:1.1.15"

export_qc_bedtools_intersect:
  extra: " -wb "

export_qc_xlsx_report:
  wanted_transcripts: "/projects/wp3/nobackup/TwistCancer/Bedfiles/wanted_transcripts.txt"

fastp_pe:
  container: "docker://hydragenetics/fastp:0.20.1"

fastqc:
  container: "docker://hydragenetics/fastqc:0.11.9"

melt:
  container: "/projects/wp3/Software/MELTv2.2.2/MELT_v2.2.2.sif"
  bed: "/projects/wp3/Software/MELTv2.2.2/add_bed_files/Hg38/Hg38.genes.bed"
  extra: ""
  mei: "/projects/wp3/Software/MELTv2.2.2/mei_list.txt"

mosaicforecast_input:
  bcftools_filter: "--exclude 'FILTER!=\"PASS\" || FORMAT/DP<8'"

mosaicforecast_phasing:
  container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
  f_format: "bam" # cram/bam
  umap: "/usr/local/bin/k24.umap.wg.bw"

mosaicforecast_readlevel:
  container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
  f_format: "bam" # cram/bam
  umap: "/usr/local/bin/k24.umap.wg.bw"

mosaicforecast_genotype_prediction:
  container: "docker://yanmei/mosaicforecast_hg38:0.0.1"
  model_trained_SNP: "/data/ref_data/wp3/mosaicforecast/150xRFmodel_addRMSK_Refine.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
  model_trained_INS: "/data/ref_data/wp3/mosaicforecast/insertions_250x.RF.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
  model_trained_DEL: "/data/ref_data/wp3/mosaicforecast/deletions_250x.RF.rds" # 200xRFmodel_addRMSK_Refine.rds or deletions_250x.RF.rds(phase) or insertions_250x.RF.rds(phase)
  model_type_SNP: "Refine"  # phase/refine
  model_type_INS: "Phase"  # phase/refine
  model_type_DEL: "Phase"  # phase/refine

mosdepth_bed:
  container: "docker://hydragenetics/mosdepth:0.3.2"
  thresholds: "10,20,50"

multiqc:
  container: "docker://hydragenetics/multiqc:1.11"
  reports:
    DNA:
      included_unit_types: ["T", "N"]
      config: "{{PATH_TO_REPO}}/marple_rd_tc/config/multiqc_config.yaml"
      qc_files:
        - "prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastp.json"
        - "qc/fastqc/{sample}_{type}_{flowcell}_{lane}_{barcode}_{read}_fastqc.zip"
        - "qc/picard_collect_alignment_summary_metrics/{sample}_{type}.alignment_summary_metrics.txt"
        - "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt"
        - "qc/picard_collect_gc_bias_metrics/{sample}_{type}.gc_bias.summary_metrics"
        - "qc/picard_collect_hs_metrics/{sample}_{type}.HsMetrics.txt"
        - "qc/picard_collect_insert_size_metrics/{sample}_{type}.insert_size_metrics.txt"
        - "qc/picard_collect_multiple_metrics/{sample}_{type}.{ext}"
        - "qc/samtools_stats/{sample}_{type}.samtools-stats.txt"
        - "qc/mosdepth_bed/{sample}_{type}.mosdepth.summary.txt"
        - "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz"
        - "qc/mosdepth_bed/{sample}_{type}.mosdepth.region.dist.txt"
        - "qc/mosdepth_bed/{sample}_{type}.mosdepth.global.dist.txt"

picard_collect_alignment_summary_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_duplication_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_gc_bias_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_hs_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_insert_size_metrics:
  container: "docker://hydragenetics/picard:2.25.4"

picard_collect_multiple_metrics:
  container: "docker://hydragenetics/picard:2.25.4"
  output_ext:
    - "gc_bias.pdf" #collect_gc_bias
    - "gc_bias.summary_metrics"
    - "gc_bias.detail_metrics"
    - "alignment_summary_metrics" # collect_alignment_summary
    - "insert_size_metrics" # collect_insert_size
    - "insert_size_histogram.pdf"
    - "error_summary_metrics" # collect_sequencing_artefact
    - "bait_bias_detail_metrics"
    - "bait_bias_summary_metrics"
    - "pre_adapter_detail_metrics"
    - "pre_adapter_summary_metrics"
    - "quality_distribution_metrics" # quality_score_distribution
    - "quality_distribution.pdf"
    - "quality_yield_metrics" # quality yield metrics

picard_mark_duplicates:
  container: "docker://hydragenetics/picard:2.25.4"

vep:
  container: "docker://ensemblorg/ensembl-vep:release_110.1"
  vep_cache: "/data/ref_genomes/VEP"
  extra: "--assembly GRCh38 --check_existing --pick --variant_class --everything"

vt_decompose:
  container: "docker://hydragenetics/vt:2015.11.10"

vt_normalize:
  container: "docker://hydragenetics/vt:2015.11.10"

Resources

An resources.yaml file can also be found in the config/-folder. This is adapted to the Uppsala Clinical Genomics' compute cluster but can be used as an indication of resources needed for the different programs.

Expand to view current resources.yaml
default_resources:
  threads: 1
  time: "12:00:00"
  mem_mb: 6144
  mem_per_cpu: 6144
  partition: "core"

bwa_mem:
  mem_mb: 122880
  mem_per_cpu: 6144
  threads: 8

deepvariant:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10
  time: "12:00:00"

deepsomatic_t_only:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10
  time: "12:00:00"

fastp_pe:
  threads: 1
  mem_mb: 6144
  mem_per_cpu: 6144

fastqc:
  threads: 2
  mem_mb: 12288
  mem_per_cpu: 6144

mosdepth_bed:
  mem_mb: 36864
  threads: 4

samtools_sort:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10

samtools_view:
  mem_mb: 61440
  mem_per_cpu: 6144
  threads: 10

vep:
  threads: 5
  mem_mb: 30720
  mem_per_cpu: 6144

🚀 Run command

# Activate the virtual environment
source virtual/environment/bin/activate

# Run snakemake command with the extra config parameter called sequenceid
snakemake --profile snakemakeprofile --configfile config.yaml --config sequenceid="230202-test" -s /path/to/marple/workflow/Snakefile --config PATH_TO_REPO=/path/to/repo/