Steps in Marple 🕵️‍♀️

To go into details of the pipeline we dived the pipeline into modules similar to Hydra-Genetics module system. Default hydra-genetics settings/resources are used if no configuration is specified.


Prealignment

See the Prealignment hydra-genetics module documentation on ReadTheDoc or github documentation for more details on the softwares.

dag plot

Pipeline output files

Only temporary intermediate files are created.

Trimming

Trimming of fastq files is performed by fastp v0.20.1.

Merging

Merging of fastq files belonging to the same sample are performed by simply concatenating the files with cat if applicable.


Alignment

See Alignment hydra-genetics module on ReadTheDocs or github for more information and documentation on the softwares used in the alignment steps.

dag plot

Pipeline output files:

  • Results/{sample}/{sample}.cram
  • Results/{sample}/{sample}.cram.crai

Alignment with BWA-mem

Alignment of fastq files into bam files is performed by bwa-mem v0.7.17 using the trimmed fastq files. This make it possible to speed up alignment by utilizing parallelization and also make it possible to analyze qc for lanes separately. Bamfiles are then directly sorted by samtools sort v1.15.

Read groups

Bam file read groups are set according to sequencing information in the units.tsv file. The @RG read tag is set using the following options defined in the hydra-genetics bwa rule:

-R '@RG\tID:{ID}\tSM:{SM}\tPL:{PL}\tPU:{PU}\tLB:{LB}' -v 1

where the individual read groups are defined below:

RG tag Value
ID sample_type.lane.barcode
SM sample_type
PL platform
PU flowcell.lane.barcode
LB sample_type

Bam splitting

The bam files are split into chromosome files for faster performance in downstream analysis. Split files are used by markduplicates. Splitting is performed by samtools view v1.15.

Mark duplicates

Flagging duplicated reads are performed on individual chromosome bam files by picard MarkDuplicates v2.25.4.

Merging

Merging of deduplicated bam files belonging to the same sample are performed by samtools merge v1.15.

Sorting

Merged bamfile are sorted by samtools sort v1.15.

Bam indexing

Bamfile indexing is performed by samtools index v1.15.


SNV indels

SNV and indels are called using the SNV_indels (ReadTheDoc or github) module. Annotation is then done with Annotation module (ReadTheDocs or github).

dag plot

Pipeline output files

  • Results/{sample}/{sample}.hard-filtered.vcf.gz
  • Results/{sample}/{sample}.genome.vcf.gz
  • Results/{sample}/mosaic/{sample}.deepmosaic.txt
  • Results/{sample}/mosaic/{sample}.deepsomatic.vcf.gz
  • Results/{sample}/mosaic/{sample}.mosaicforecast.phasing
  • Results/{sample}/mosaic/{sample}.mosaicforecast.DEL.predictions
  • Results/{sample}/mosaic/{sample}.mosaicforecast.INS.predictions
  • Results/{sample}/mosaic/{sample}.mosaicforecast.SNP.predictions

SNV calling

Variants are called using DeepVariant. Deepvariant is run per chromosome over the regions defined in config["references"]["design_bed"]. The model type is set to "WES" and --output_gvcf is used to ensure that both genome vcf as well as a standard vcf is produced. The AF field is added to the INFO column in the vcf:s using the fix_af.py. The vcf header in the standard vcf is also updated to include a reference line using the add_ref_to_vcf.py to ensure that programs such as Alissa acknowledge the use of Hg38.

Normalizing

The standard vcf files is decomposed with vt decompose followed by vt decompose_blocksub v2015.11.10. The decomposed vcf files are then normalized by vt normalize v2015.11.10.

Annotation

The normalized standard VCF files are then annotated using VEP v109.3. Vep is run with the extra parameters --assembly GRCh38 --check_existing --pick --variant_class --everything.

See the annotation hydra-genetics module for additional information.

Mosaic variant calling

Possible mosaic variants is called by DeepSomatic which are then predicted to be real or not with MosaicForecast and DeepMosaic.


CNV

CNVs are called using the Hydra-Genetics CNV_SV module (ReadTheDocs or github).

dag plot

Pipeline output files

  • Results/{sample}_{sequenceid}/{sample}_{sequenceid}_exomedepth_SV.txt
  • Results/{sample}_{sequenceid}/{sample}_{sequenceid}_exomedepth.aed
  • Results/{sample}/{sample}.cnv.vcf.gz
  • Results/{sample}/mobile_elements/{sample}.ALU.vcf.gz
  • Results/{sample}/mobile_elements/{sample}.LINE1.vcf.gz
  • Results/{sample}/mobile_elements/{sample}.HERVK.vcf.gz
  • Results/{sample}/mobile_elements/{sample}.SVA.vcf.gz

Exomedepth

To call larger structural variants Exomedepth v1.1.15 is used. Exomedepth does not use a window approach but evaluates each row in the bedfile as a segment, therefor the bedfile need to be split into appropriate large windows (e.g. using bedtools makewindows). Exomedepth also need a RData file containing the normal pool, this can be created using the Marple - references workflow. Lines with no-change calls (reads.ratio == 1) are removed from the output for Alissa compatibility. Since no sex-chromosome are included in the design Exomedepth is run with the same normalpool irregardless of the sample's biological sex. Marple is designed on HG38 therefor a genes and exonfile are also needed for annotation.

Mobile elements

To find mobile elements, like ALU, HERVK, LINE1 and SVA, MELT is used. It is free to use for academic purposes and you can buy a licence if used in other cases, like for this pipeline as part of the clinical workflow at the hospital. Since you should ask to use MELT, the singularity we use are local.


QC

For quality control the QC module (ReadTheDocs or github) is used and the results are summarized/aggregated into a MultiQC-report.


dag plot

Pipeline output files

  • Results/{sequenceid}_MultiQC.html

MultiQC

A MultiQC html report is generated using MultiQC v1.11. The report starts with a general statistics table showing the most important QC-values followed by additional QC data and diagrams. See result files page for more in-depth info on general stats columns. The qc data is collected and generated using Fastp, FastQC, samtools, picard and Mosdepth.

The samples in the MultiQC report is renamed and ordered based on the order of the SampleSheet.csv to enable easier integration with the wet lab follow-up.

The report is configured based on a MultiQC config file.

Expand to view current MultiQC config.yaml
title: "Clinical Genomics MultiQC Report"
subtitle: "Twist Cancer"
intro_text: "The MultiQC report summarize analysis results from Twist Cancer data that been analyzed by the pipeline marple_rd_tc (https://github.com/clinical-genomics-uppsala/marple-rd-tc). Reference used: GRCh38."


report_header_info:
  - Contact E-mail: "igp-klinsek-bioinfo@lists.uu.se"
  - Application Type: "Twist Cancer Panel"
  - Project Type: "Clinical Samples"
  # - Sequencing Platform: "HiSeq 2500 High Output V4"
  # - Sequencing Setup: "2x150"

decimalPoint_format: ','

## maste anpassa configen lite mera. 20x breadth, insert size och bases on target. Fold80?
extra_fn_clean_exts: ##from this until end
    - '.duplication_metrics'
    - '_N'

custom content:
  order:
    - fastqc
    - mosdepth
    - fastp
    - peddy
    - samtools
    - picard

mosdepth_config:
  include_contigs:
    - "chr1"
    - "chr2"
    - "chr3"
    - "chr4"
    - "chr5"
    - "chr6"
    - "chr7"
    - "chr8"
    - "chr9"
    - "chr10"
    - "chr11"
    - "chr12"
    - "chr13"
    - "chr14"
    - "chr15"
    - "chr16"
    - "chr17"
    - "chr18"
    - "chr19"
    - "chr20"
    - "chr21"
    - "chr22"

read_count_multiplier: 0.001
read_count_prefix: "K"
read_count_desc: "thousands"


table_columns_visible:
  FastQC:
    percent_duplicates: False
    percent_gc: False
    avg_sequence_length: False
    percent_fails: False
    total_sequences: False
  fastp:
    pct_adapter: True
    pct_surviving: False
    after_filtering_gc_content: False
    filtering_result_passed_filter_reads: False
    after_filtering_q30_bases: False
    after_filtering_q30_rate: False
    pct_duplication: False
  mosdepth:
    median_coverage: True
    mean_coverage: False
    1_x_pc: False
    5_x_pc: False
    10_x_pc: False
    20_x_pc: False
    30_x_pc: True
    50_x_pc: True
  Picard:
    PCT_PF_READS_ALIGNED: False
    summed_median: False
    summed_mean: True
    PERCENT_DUPLICATION: True
    MEDIAN_COVERAGE: False
    MEAN_COVERAGE: False
    SD_COVERAGE: False
    PCT_30X: False
    PCT_TARGET_BASES_30X: False
    FOLD_ENRICHMENT: False
    TOTAL_READS: False
  Samtools:
    error_rate: False
    non-primary_alignments: False
    reads_mapped: False
    reads_mapped_percent: True
    reads_properly_paired_percent: True
    reads_MQ0_percent: False
    raw_total_sequences: True #only on bedfile not total of fastq, bases on target only

# Patriks plug in, addera egna columner till general stats
multiqc_cgs:
  Picard:
    FOLD_80_BASE_PENALTY:
      title: "Fold80"
      description: "Fold80 penalty from picard hs metrics"
      min: 1
      max: 3
      scale: "RdYlGn-rev"
      format: "{:.1f}"
    PCT_SELECTED_BASES:
      title: "Bases on Target"
      description: "On+Near Bait Bases / PF Bases Aligned from Picard HsMetrics"
      format: "{:.2%}"
    ZERO_CVG_TARGETS_PCT:
      title: "Target bases with zero coverage [%]"
      description: "Target bases with zero coverage [%] from Picard"
      min: 0
      max: 100
      scale: "RdYlGn-rev"
      format: "{:.2%}"
  Samtools:
    average_quality:
      title: "Average Quality"
      description: "Ratio between the sum of base qualities and total length from Samtools stats"
      min: 0
      max: 60
      scale: "RdYlGn"
  mosdepth:
     20_x_pc: #Cant get it to work
        title: "20x percent"
        description: "Fraction of genome with at least 20X coverage"
        max: 100
        min: 0
        suffix: "%"
        scale: "RdYlGn"

# Galler alla kolumner oberoende pa module!
table_columns_placement:
  mosdepth:
    median_coverage: 601
    1_x_pc: 666
    5_x_pc: 666
    10_x_pc: 602
    20_x_pc: 603
    30_x_pc: 604
    50_x_pc: 605
  Samtools:
    raw_total_sequences: 500
    reads_mapped: 501
    reads_mapped_percent: 502
    reads_properly_paired_percent: 503
    average_quality: 504
    error_rate: 555
    reads_MQ0_percent: 555
    non-primary_alignments: 555
  Picard:
    TOTAL_READS: 500
    PCT_SELECTED_BASES: 801
    FOLD_80_BASE_PENALTY: 802
    PCT_PF_READS_ALIGNED: 888
    summed_median: 888
    PERCENT_DUPLICATION: 803
    summed_mean: 804
    STANDARD_DEVIATION: 805
    ZERO_CVG_TARGETS_PCT: 888
    MEDIAN_COVERAGE: 888
    MEAN_COVERAGE: 888
    SD_COVERAGE: 888
    PCT_30X: 888
    PCT_TARGET_BASES_30X: 888
    FOLD_ENRICHMENT: 888

FastQC

FastQC v0.11.9 is run on the raw fastq-files before trimming.

Mosdepth

Mosdepth v0.3.2 is used together with a bedfile covering all coding exons (config[reference][exon_bed]) and thresholds (10,20,50) to calculate coverage.

Samtools

Samtools stats v1.15 is run on BWA-mem aligned and merged bam files without any designfile.

Picard

Picard v2.25.4 is run on BWA-mem aligned and merged bam files collecting a number of metrics. The metrics calculated are listed below:

  • picard CollectAlignmentSummaryMetrics - using fasta reference genome file
  • picard CollectDuplicationMetrics
  • picard CollectGCBiasMetrics
  • picard CollectHsMetrics - using fasta reference genome file, the full capture design bedfile (config[reference][design_bed]), and with the option COVERAGE_CAP=5000
  • picard CollectInsertSizeMetrics
  • picard CollectMultipleMetrics - using fasta reference genome file, and the full bedfile (config[reference][design_intervals])