Pipeline specific rules/softwares used in Marple

Rules created specifically for Marple pipeline are listed here.

add_ref_to_vcf.smk

A pythonscript to add a line to vcf-files with reference path for Alissa to know which genome build used. The ##reference=-line need to contain either hg38 or GRCh38 for Alissa to understand that the reference is not hg19.

🐍 Rule

rule add_ref_to_vcf:
    input:
        vcf="snv_indels/deepvariant/{sample}_N.normalized.sorted.vep_annotated.vcf.gz",
        ref=config["reference"]["fasta"],
    output:
        vcf=temp("snv_indels/deepvariant/{sample}_N.normalized.sorted.vep_annotated.ref.vcf"),
    log:
        "snv_indels/deepvariant/{sample}_N.normalized.sorted.vep_annotated.ref.vcf.log",
    benchmark:
        repeat(
            "snv_indels/deepvariant/{sample}_N.normalized.sorted.vep_annotated.ref.vcf.benchmark.tsv",
            config.get("add_ref_to_vcf", {}).get("benchmark_repeats", 1),
        )
    resources:
        mem_mb=config.get("add_ref_to_vcf", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("add_ref_to_vcf", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("add_ref_to_vcf", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("add_ref_to_vcf", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("add_ref_to_vcf", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("add_ref_to_vcf", {}).get("container", config["default_container"])
    message:
        "{rule}: Add reference to the header of the deepvariant vcf: {input.vcf}"
    script:
        "../scripts/add_ref_to_vcf.py"

↔ input / output files

Rule parameters Key Value Description
input vcf "snv_indels/deepvariant/{sample}_N.normalized.sorted.vep_annotated.vcf.gz" final vcf where reference should be added to vcf-header
ref config["reference"]["fasta"] fasta reference used
output vcf "snv_indels/deepvariant/{sample}_N.normalized.sorted.vep_annotated.ref.vcf" final vcf with reference genome in vcf-header

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

exomedepth_export.smk

A Rscript to create output files from exomedepth results.

🐍 Rule

rule exomedepth_export:
    input:
        exon="cnv_sv/exomedepth_call/{sample}_{type}.RData",
    output:
        aed=temp("cnv_sv/exomedepth_call/{sample}_{type}.aed"),
        nexus_sv=temp("cnv_sv/exomedepth_call/{sample}_{type}_SV.txt"),
    params:
        extra=config.get("exomedepth_export", {}).get("extra", ""),
    log:
        "cnv_sv/exomedepth_call/{sample}_{type}_SV.txt.log",
    benchmark:
        repeat(
            "cnv_sv/exomedepth_call/{sample}_{type}_SV.txt.benchmark.tsv",
            config.get("exomedepth_export", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("exomedepth_export", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("exomedepth_export", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("exomedepth_export", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("exomedepth_export", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("exomedepth_export", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("exomedepth_export", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("exomedepth_export", {}).get("container", config["default_container"])
    message:
        "{rule}: Export exomedepth CNV results from {input.exon} "
    script:
        "../scripts/exomedepth_export.R"

↔ input / output files

Rule parameters Key Value Description
input exon "cnv_sv/exomedepth_call/{sample}_{type}.RData" Rdata from exomedepth call
output aed "cnv_sv/exomedepth_call/{sample}_{type}.aed" calls from exomedepth in aed format
nexus_sv "cnv_sv/exomedepth_call/{sample}_{type}_SV.txt" nexus SV txt file with exomedepth calls

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string name or path to docker/singularity container

Resources settings (resources.yaml)

Key Type Description
threads integer number of threads that will be used by exomedepth_export
time string max execution time for exomedepth_export
mem_mb integer memory used for exomedepth_export
mem_per_cpu integer memory used per cpu for exomedepth_export
partition string partition to use on the cluster for exomedepth_export

export_qc.smk

Rules that creates a .xlsx file per sample with aggregated coverage information.

🐍 Rule

rule export_qc_bedtools_intersect:
    input:
        left="qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz",
        coverage_csi="qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz.csi",
        right=config["reference"]["exon_bed"],
    output:
        results=temp("qc/mosdepth_bed/{sample}_{type}.mosdepth.per-base.exon_bed.txt"),
    params:
        extra=config.get("export_qc_bedtools_intersect", {}).get("extra", ""),
    log:
        "qc/mosdepth_bed/{sample}_{type}.mosdepth.per-base.exon_bed.log",
    benchmark:
        repeat(
            "qc/mosdepth_bed/{sample}_{type}.mosdepth.per-base.exon_bed.benchmark.tsv",
            config.get("export_qc_bedtools_intersect", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("export_qc_bedtools_intersect", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("export_qc_bedtools_intersect", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("export_qc_bedtools_intersect", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("export_qc_bedtools_intersect", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("export_qc_bedtools_intersect", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("export_qc_bedtools_intersect", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("export_qc_bedtools_intersect", {}).get("container", config["default_container"])
    message:
        "{rule}: export low cov regions from {input.left} based on {input.right}"
    wrapper:
        "v1.32.0/bio/bedtools/intersect"

↔ input / output files

Rule parameters Key Value Description
input left "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz" per-base coverage file from mosdepth
coverage_csi "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz.csi" index file for per-base.bed.gz file
right config["reference"]["exon_bed"] design bed used to only look at coverage based on bedfile
output results "qc/mosdepth_bed/{sample}_{type}.mosdepth.per-base.exon_bed.txt" .txt file with coverage per base for design file

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string path to container with bedtools (common)
extra string extra configuration for bedtools intersect

Resources settings (resources.yaml)

Key Type Description
threads integer number of threads that will be used by export_qc_bedtools_intersect
time string max execution time for export_qc_bedtools_intersect
mem_mb integer memory used for eexport_qc_bedtools_intersect
mem_per_cpu integer memory used per cpu for export_qc_bedtools_intersect
partition string partition to use on the cluster for export_qc_bedtools_intersect

🐍 Rule

rule export_qc_bedtools_intersect_pgrs:
    input:
        left="qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz",
        coverage_csi="qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz.csi",
        right=config["reference"]["pgrs_bed"],
    output:
        results=temp("qc/mosdepth_bed/{sample}_{type}.mosdepth.pgrs_cov.txt"),
    params:
        extra=config.get("export_qc_bedtools_intersect", {}).get("extra", ""),
    log:
        "qc/mosdepth_bed/{sample}_{type}.mosdepth.pgrs_cov.log",
    benchmark:
        repeat(
            "qc/mosdepth_bed/{sample}_{type}.mosdepth.pgrs_cov.benchmark.tsv",
            config.get("export_qc_bedtools_intersect", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("export_qc_bedtools_intersect", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("export_qc_bedtools_intersect", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("export_qc_bedtools_intersect", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("export_qc_bedtools_intersect", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("export_qc_bedtools_intersect", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("export_qc_bedtools_intersect", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("export_qc_bedtools_intersect", {}).get("container", config["default_container"])
    message:
        "{rule}: export low cov regions from {input.left} based on {input.right}"
    wrapper:
        "v1.32.0/bio/bedtools/intersect"

↔ input / output files

Rule parameters Key Value Description
input left "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz" per-base coverage file from mosdepth
coverage_csi "qc/mosdepth_bed/{sample}_{type}.per-base.bed.gz.csi" index file for per-base.bed.gz file
right config["reference"]["pgrs_bed"] design bed used to only look at coverage based on bedfile, in this case pgrs positions
output results "qc/mosdepth_bed/{sample}_{type}.mosdepth.pgrs_cov.txt" .txt file with coverage per base for design file

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string path to container with bedtools

Resources settings (resources.yaml)

Key Type Description
threads integer number of threads that will be used by export_qc_bedtools_intersect_pgrs
time string max execution time for export_qc_bedtools_intersect_pgrs
mem_mb integer memory used for eexport_qc_bedtools_intersect_pgrs
mem_per_cpu integer memory used per cpu for export_qc_bedtools_intersect_pgrs
partition string partition to use on the cluster for export_qc_bedtools_intersect_pgrs

🐍 Rule

rule export_qc_xlsx_report:
    input:
        mosdepth_summary="qc/mosdepth_bed/{sample}_{type}.mosdepth.summary.txt",
        mosdepth_thresholds="qc/mosdepth_bed/{sample}_{type}.thresholds.bed.gz",
        mosdepth_regions="qc/mosdepth_bed/{sample}_{type}.regions.bed.gz",
        mosdepth_perbase="qc/mosdepth_bed/{sample}_{type}.mosdepth.per-base.exon_bed.txt",
        picard_dup="qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt",
        pgrs_coverage="qc/mosdepth_bed/{sample}_{type}.mosdepth.pgrs_cov.txt",
        design_bed=config["reference"]["design_bed"],
        pgrs_bed=config["reference"]["pgrs_bed"],
        wanted_transcripts=config["export_qc_xlsx_report"]["wanted_transcripts"],
    output:
        results=temp("qc/xlsx_report/{sample}_{type}.xlsx"),
    params:
        coverage_thresholds=config["mosdepth_bed"]["thresholds"],
        sequenceid=config["sequenceid"],
    log:
        "qc/xlsx_report/{sample}_{type}.xlsx.log",
    benchmark:
        repeat(
            "qc/xlsx_report/{sample}_{type}.xlsx.benchmark.tsv",
            config.get("export_qc_xlsx_report", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("export_qc_xlsx_report", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("export_qc_xlsx_report", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("export_qc_xlsx_report", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("export_qc_xlsx_report", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("export_qc_xlsx_report", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("export_qc_xlsx_report", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("export_qc_xlsx_report", {}).get("container", config["default_container"])
    message:
        "{rule}: collecting qc values into {output}"
    # localrule: True
    script:
        "../scripts/export_qc_xlsx_report.py"

↔ input / output files

Rule parameters Key Value Description
input mosdepth_summary "qc/mosdepth_bed/{sample}_{type}.mosdepth.summary.txt" mosdepth bed summary file
mosdepth_thresholds "qc/mosdepth_bed/{sample}_{type}.thresholds.bed.gz" Mosdepth bed thresholds file
mosdepth_regions "qc/mosdepth_bed/{sample}_{type}.regions.bed.gz" mosdepth bed coverage per region file
mosdepth_perbase "qc/mosdepth_bed/{sample}_{type}.mosdepth.per-base.exon_bed.txt" mosdepth bed per-base result file subsampled into exons in export_qc_bedtools_intersect output
picard_dup "qc/picard_collect_duplication_metrics/{sample}_{type}.duplication_metrics.txt" picard collect duplication metrics results file
pgrs_coverage "qc/mosdepth_bed/{sample}_{type}.mosdepth.pgrs_cov.txt" mosdepth per-base file from export_qc_bedtools_intersect_pgrs output
design_bed config["reference"]["design_bed"] design bed defined in config-file
pgrs_bed config["reference"]["pgrs_bed"] bedfile with PGRS score SNPs
wanted_transcripts config["export_qc_xlsx_report"]["wanted_transcripts"] path to txt-file in bedformat of transcripts of interest
output results "qc/xlsx_report/{sample}_{type}.xlsx" .xlsx file with summarized QC-values per sample

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string path to container, pyton3, gzip, date and xlsxwriter
wanted_transcripts string transcripts of interest to be highlighted in xlsx report

Resources settings (resources.yaml)

Key Type Description
threads integer number of threads that will be used by export_qc_xlsx_report
time string max execution time for export_qc_xlsx_report
mem_mb integer memory used for export_qc_xlsx_report
mem_per_cpu integer memory used per cpu for export_qc_xlsx_report
partition string partition to use on the cluster for export_qc_xlsx_report

sample_order_multiqc.smk

A python script to create sample_replacement and sample_order files to be used in MultiQC to order samples based on order of the "S"-index in the samplenames.

🐍 Rule

rule sample_order_multiqc:
    output:
        replacement=temp("qc/multiqc/sample_replacement.tsv"),
        order=temp("qc/multiqc/sample_order.tsv"),
    params:
        filelist=[(u.sample, u.fastq1) for u in units[units.type == "N"].itertuples()],
    log:
        "qc/multiqc/sample_order.tsv.log",
    benchmark:
        repeat("qc/multiqc/sample_order.tsv.benchmark.tsv", config.get("sample_order_multiqc", {}).get("benchmark_repeats", 1))
    threads: config.get("sample_order_multiqc", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("sample_order_multiqc", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("sample_order_multiqc", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("sample_order_multiqc", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("sample_order_multiqc", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("sample_order_multiqc", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("sample_order_multiqc", {}).get("container", config["default_container"])
    message:
        "{rule}: Create a sample order tsv based on S_index in {params.filelist} for multiqc"
    script:
        "../scripts/sample_order_multiqc.py"

↔ input / output files

Rule parameters Key Value Description
output replacement "qc/multiqc/sample_replacement.tsv" list of sample name replacement, sampleXXX based on order in SampleSheet
order "qc/multiqc/sample_order.tsv" list of back-translated name from sampleXXX to original names

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string path to container

Resources settings (resources.yaml)

Key Type Description
threads integer number of threads that will be used by sample_order_multiqc
time string max execution time for sample_order_multiqc
mem_mb integer memory used for sample_order_multiqc
mem_per_cpu integer memory used per cpu for sample_order_multiqc
partition string partition to use on the cluster for sample_order_multiqc

[tsv2vcf]

Convert exomedepth calls in tsv format to VCF

🐍 Rule

rule tsv2vcf:
    input:
        tsv="cnv_sv/exomedepth_call/{sample}_{type}.txt",
        ref=config["reference"]["fasta"],
    output:
        vcf="cnv_sv/exomedepth_call/{sample}_{type}.vcf",
    params:
        extra=config.get("tsv2vcf", {}).get("extra", ""),
    log:
        "cnv_sv/exomedepth_call/{sample}_{type}.vcf.gz.log",
    benchmark:
        repeat(
            "cnv_sv/exomedepth_call/{sample}_{type}.vcf.gz.benchmark.tsv", config.get("tsv2vcf", {}).get("benchmark_repeats", 1)
        )
    threads: config.get("tsv2vcf", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("tsv2vcf", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("tsv2vcf", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("tsv2vcf", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("tsv2vcf", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("tsv2vcf", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("tsv2vcf", {}).get("container", config["default_container"])
    message:
        "{rule}: convert {input.tsv} to VCF"
    script:
        "../scripts/tsv2vcf.sh"

↔ input / output files

Rule parameters Key Value Description
input tsv "cnv_sv/exomedepth_call/{sample}_{type}.txt" Exomdepth calls in csv format
ref config["reference"]["fasta"] reference geneome fasta file
output vcf "cnv_sv/exomedepth_call/{sample}_{type}.vcf" Exomedepth calls in compressed VCF

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time