Result files

Marple produces a lot of intermediate and result files but only files defined in output_files.yaml are kept, the rest are temporary and will be deleted when not needed in the any consecutive rules. If other files than the predefined are wanted you need to edit output_files.yaml or add --no-temp to the running command.

Files

The following output files are located in Results/-folder:

File Format Description
multiqc_DNA.html html Aggregated QC values for entire sequence run, open in browser
{sample}/{sample}.xlsx xlsx Excel file with QC stats (primarily coverage) for each sample
{sample}/{sample}_N.cram" cram Deduplicated alignment file
{sample}/{sample}_N.cram.crai crai Index for alignment file
{sample}/{sample}.hard-filtered.vcf.gz vcf.gz Compressed VCF-file decomposed, normalized and annotated with vep
{sample}/{sample}.hard-filtered.vcf.gz.tbi tbi Index for variant file
{sample}/{sample}.genome.vcf.gz genome.vcf.gz Compressed VCF-file for all positions in the design, not decomposed nor normalized
{sample}/{sample}.genome.vcf.gz.tbi tbi Index for genome VCF-file
{sample}/{sample}_exomedepth_SV.txt txt Nexus SV text file with structural variants from ExomeDepth
{sample}/{sample}_exomedepth.aed aed aed text file with structural variants from ExomeDepth
{sample}/{sample}.cnv.vcf.gz vcf.gz Compressed VCF-file with structural variants from ExomeDepth
{sample}/{sample}.cnv.vcf.gz.tbi tbi Index for variant file from ExomeDepth
{sample}/mobile_elements/{sample}.ALU.vcf.gz vcf.gz Compressed VCF-file with predicted ALU elements
{sample}/mobile_elements/{sample}.LINE1.vcf.gz vcf.gz Compressed VCF-file with predicted LINE1 elements
{sample}/mobile_elements/{sample}.HERVK.vcf.gz vcf.gz Compressed VCF-file with predicted HERVK elements
{sample}/mobile_elements/{sample}.SVA.vcf.gz vcf.gz Compressed VCF-file with predicted SVA elements
{sample}/mosaic/{sample}.deepmosaic.txt tsv Candidate variants and their predictions from DeepMosaic
{sample}/mosaic/{sample}.deepsomatic.vcf.gz vcf.gz Compressed VCF-file from DeepSomatic where PASS are possible mosaic variants
{sample}/mosaic/{sample}.deepsomatic.vcf.gz.tbi vcf.gz Index for genome VCF-file
{sample}/mosaic/{sample}.mosaicforecast.phasing tsv Candidate mosaic variants based on phasing from MosaicForecast
{sample}/mosaic/{sample}.mosaicforecast.DEL.predictions tsv Candidate deletion variants and their predictions from MosaicForecast
{sample}/mosaic/{sample}.mosaicforecast.INS.predictions tsv Candidate insertion variants and their predictions from MosaicForecast
{sample}/mosaic/{sample}.mosaicforecast.SNP.predictions tsv Candidate SNP variants and their predictions from MosaicForecast
{sequenceid}_config.yaml yaml yaml config-file with programversion and extra settings used
{sequenceid}_config_exomedepth.yaml yaml yaml config-file with which reference was used for ExomeDepth

MultiQC report

Marple produces a MultiQC-report for the entire sequencing run to enable easier QC tracking. The report starts with a general statistics table showing the most important QC-values followed by additional QC data and diagrams. The entire MultiQC html-file is interactive and you can filter, highlight, hide or export data using the ToolBox at the right edge of the report.


The report is configured based on a MultiQC config file.

Expand to view current MultiQC config.yaml
title: "Clinical Genomics MultiQC Report"
subtitle: "Twist Cancer"
intro_text: "The MultiQC report summarize analysis results from Twist Cancer data that been analyzed by the pipeline marple_rd_tc (https://github.com/clinical-genomics-uppsala/marple-rd-tc). Reference used: GRCh38."


report_header_info:
  - Contact E-mail: "igp-klinsek-bioinfo@lists.uu.se"
  - Application Type: "Twist Cancer Panel"
  - Project Type: "Clinical Samples"
  # - Sequencing Platform: "HiSeq 2500 High Output V4"
  # - Sequencing Setup: "2x150"

decimalPoint_format: ','

## maste anpassa configen lite mera. 20x breadth, insert size och bases on target. Fold80?
extra_fn_clean_exts: ##from this until end
    - '.duplication_metrics'
    - '_N'

custom content:
  order:
    - fastqc
    - mosdepth
    - fastp
    - peddy
    - samtools
    - picard

mosdepth_config:
  include_contigs:
    - "chr1"
    - "chr2"
    - "chr3"
    - "chr4"
    - "chr5"
    - "chr6"
    - "chr7"
    - "chr8"
    - "chr9"
    - "chr10"
    - "chr11"
    - "chr12"
    - "chr13"
    - "chr14"
    - "chr15"
    - "chr16"
    - "chr17"
    - "chr18"
    - "chr19"
    - "chr20"
    - "chr21"
    - "chr22"

read_count_multiplier: 0.001
read_count_prefix: "K"
read_count_desc: "thousands"


table_columns_visible:
  FastQC:
    percent_duplicates: False
    percent_gc: False
    avg_sequence_length: False
    percent_fails: False
    total_sequences: False
  fastp:
    pct_adapter: True
    pct_surviving: False
    after_filtering_gc_content: False
    filtering_result_passed_filter_reads: False
    after_filtering_q30_bases: False
    after_filtering_q30_rate: False
    pct_duplication: False
  mosdepth:
    median_coverage: True
    mean_coverage: False
    1_x_pc: False
    5_x_pc: False
    10_x_pc: False
    20_x_pc: False
    30_x_pc: True
    50_x_pc: True
  Picard:
    PCT_PF_READS_ALIGNED: False
    summed_median: False
    summed_mean: True
    PERCENT_DUPLICATION: True
    MEDIAN_COVERAGE: False
    MEAN_COVERAGE: False
    SD_COVERAGE: False
    PCT_30X: False
    PCT_TARGET_BASES_30X: False
    FOLD_ENRICHMENT: False
    TOTAL_READS: False
  Samtools:
    error_rate: False
    non-primary_alignments: False
    reads_mapped: False
    reads_mapped_percent: True
    reads_properly_paired_percent: True
    reads_MQ0_percent: False
    raw_total_sequences: True #only on bedfile not total of fastq, bases on target only

# Patriks plug in, addera egna columner till general stats
multiqc_cgs:
  Picard:
    FOLD_80_BASE_PENALTY:
      title: "Fold80"
      description: "Fold80 penalty from picard hs metrics"
      min: 1
      max: 3
      scale: "RdYlGn-rev"
      format: "{:.1f}"
    PCT_SELECTED_BASES:
      title: "Bases on Target"
      description: "On+Near Bait Bases / PF Bases Aligned from Picard HsMetrics"
      format: "{:.2%}"
    ZERO_CVG_TARGETS_PCT:
      title: "Target bases with zero coverage [%]"
      description: "Target bases with zero coverage [%] from Picard"
      min: 0
      max: 100
      scale: "RdYlGn-rev"
      format: "{:.2%}"
  Samtools:
    average_quality:
      title: "Average Quality"
      description: "Ratio between the sum of base qualities and total length from Samtools stats"
      min: 0
      max: 60
      scale: "RdYlGn"
  mosdepth:
     20_x_pc: #Cant get it to work
        title: "20x percent"
        description: "Fraction of genome with at least 20X coverage"
        max: 100
        min: 0
        suffix: "%"
        scale: "RdYlGn"

# Galler alla kolumner oberoende pa module!
table_columns_placement:
  mosdepth:
    median_coverage: 601
    1_x_pc: 666
    5_x_pc: 666
    10_x_pc: 602
    20_x_pc: 603
    30_x_pc: 604
    50_x_pc: 605
  Samtools:
    raw_total_sequences: 500
    reads_mapped: 501
    reads_mapped_percent: 502
    reads_properly_paired_percent: 503
    average_quality: 504
    error_rate: 555
    reads_MQ0_percent: 555
    non-primary_alignments: 555
  Picard:
    TOTAL_READS: 500
    PCT_SELECTED_BASES: 801
    FOLD_80_BASE_PENALTY: 802
    PCT_PF_READS_ALIGNED: 888
    summed_median: 888
    PERCENT_DUPLICATION: 803
    summed_mean: 804
    STANDARD_DEVIATION: 805
    ZERO_CVG_TARGETS_PCT: 888
    MEDIAN_COVERAGE: 888
    MEAN_COVERAGE: 888
    SD_COVERAGE: 888
    PCT_30X: 888
    PCT_TARGET_BASES_30X: 888
    FOLD_ENRICHMENT: 888

General Statistics

The general statistics table are ordered based on the fastq-file "S"-index, e.g. sampleT_S1_R1_001.fastq.gz will be before sampleA_S2_R1_001.fastq.gz. This is done by renaming the samples in two steps using the script sample_order_multiqc.py. To toggle between "Sample Order" and "Sample Name" use the buttons just above General Stats header.


Column Name Origin Comment
K Reads Samtools stats Total number of reads in inputfile (alignment/samtools_merge_bam/{sample}_{type}.bam)
% Mapped Samtools stats Percent reads mapped, anywhere in the reference (no design file used)
% Proper pairs Samtools stats Only reads on target (config[reference][design_bed])
Average Quality Samtools stats Ratio between sum of base quality over total length. Only reads on target (config[reference][design_bed])
Median Mosdepth Median Coverage over coding exon in design (config[reference][exon_bed])
>= 30X Mosdepth Fraction of coding exons (config[reference][exon_bed]) with coverage over 30x
>=50X Mosdepth Fraction of coding exons (config[reference][exon_bed]) with coverage over 50x
Bases on Target Picard HSMetrics Bases inside the capture design (config[reference][design_intervals])
Fold80 Picard HSMetrics The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets (config[reference][design_intervals])
% Dups Picard DuplicationMetrics
Mean Insert Size Picard InsertSizeMetrics
Target Bases with zero coverage [%] Picard HSMetrics Percent target (config[reference][design_intervals]) bases with 0 coverage
% Adapter fastp