FLAMES: FLAMES: Full Length Analysis of Mutations and Splicing in long read RNA-seq data

Package ‘FLAMES’

August 26, 2024

Title FLAMES: Full Length Analysis of Mutations and Splicing in long

read RNA-seq data

Version 1.11.1

Date 2023-03-27

Description Semi-supervised isoform detection and annotation from both bulk and single-cell

long read RNA-seq data. Flames provides automated pipelines for analysing isoforms,

as well as intermediate functions for manual execution.

biocViews RNASeq, SingleCell, Transcriptomics, DataImport,

DifferentialSplicing, AlternativeSplicing, GeneExpression,

LongRead

BugReports https://github.com/mritchielab/FLAMES/issues

License GPL (>= 3)

Encoding UTF-8

Imports basilisk, bambu, Biostrings, BiocGenerics, circlize,

ComplexHeatmap, cowplot, dplyr, DropletUtils, GenomicRanges,

GenomicFeatures, txdbmaker, GenomicAlignments, GenomeInfoDb,

ggplot2, ggbio, grid, gridExtra, igraph, jsonlite, magrittr,

Matrix, parallel, reticulate, Rsamtools, rtracklayer,

RColorBrewer, SingleCellExperiment, SummarizedExperiment,

scater, S4Vectors, scuttle, stats, scran, stringr,

MultiAssayExperiment, tidyr, utils, withr, future, methods,

tibble, tidyselect, IRanges

Suggests BiocStyle, GEOquery, knitr, rmarkdown, markdown,

BiocFileCache, R.utils, ShortRead, uwot, testthat (>= 3.0.0),

xml2

LinkingTo Rcpp, Rhtslib, testthat

SystemRequirements GNU make, C++17, samtools (>= 1.19), minimap2 (>=

2.17)

RoxygenNote 7.3.2

VignetteBuilder knitr

URL https://github.com/OliverVoogd/FLAMES

2 Contents

Conﬁg/testthat/edition 3

Depends R (>= 4.1.0)

LazyData true

StagedInstall no

git_url https://git.bioconductor.org/packages/FLAMES

git_branch devel

git_last_commit 2884c96

git_last_commit_date 2024-08-20

Repository Bioconductor 3.20

Date/Publication 2024-08-26

Author Luyi Tian [aut],

Changqing Wang [aut, cre],

Yupei You [aut],

Oliver Voogd [aut],

Jakob Schuster [aut],

Shian Su [aut],

Matthew Ritchie [ctb]

Maintainer Changqing Wang <[email protected]>

Contents

annotation_to_fasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

blaze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

bulk_long_pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

combine_sce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

create_conﬁg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

create_sce_from_dir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

create_se_from_dir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

cutadapt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

demultiplex_sockeye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

ﬁlter_annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

ﬁnd_barcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

ﬁnd_isoform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

ﬁnd_variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

FLAMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

ﬂexiplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

get_GRangesList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

minimap2_align . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

minimap2_realign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

parse_gff_tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

plot_coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

plot_demultiplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

quantify_gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

quantify_transcript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

annotation_to_fasta 3

scmixology_lib10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

scmixology_lib10_transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

scmixology_lib90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

sc_DTU_analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

sc_heatmap_expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

sc_long_multisample_pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

sc_long_pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

sc_mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

sc_reduce_dims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

sc_umap_expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

sys_which . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Index 42

annotation_to_fasta GTF/GFF to FASTA conversion

Description

convert the transcript annotation to transcriptome assembly as FASTA ﬁle. The genome annotation

is ﬁrst imported as TxDb object and then used to extract transcript sequence from the genome

assembly.

Usage

annotation_to_fasta(isoform_annotation, genome_fa, outdir, extract_fn)

Arguments

isoform_annotation

Path to the annotation ﬁle (GTF/GFF3)

genome_fa The ﬁle path to genome fasta ﬁle.

outdir The path to directory to store the transcriptome as transcript_assembly.fa.

extract_fn (optional) Function to extract GRangesList from the genome TxDb object. E.g.

function(txdb){GenomicFeatures::cdsBy(txdb, by="tx", use.names=TRUE)}

Value

Path to the outputted transcriptome assembly

Examples

fasta <- annotation_to_fasta(system.file("extdata/rps24.gtf.gz", package = "FLAMES"), system.file("extdata/rps24.fa.gz", package = "FLAMES"), tempdir())

cat(readChar(fasta, nchars = 1e3))

4 blaze

blaze BLAZE Assign reads to cell barcodes.

Description

Uses BLAZE to generate barcode list and assign reads to cell barcodes.

Usage

blaze(expect_cells, fq_in, ...)

Arguments

expect_cells Integer, expected number of cells. Note: this could be just a rough estimate.

E.g., the targeted number of cells.

fq_in File path to the fastq ﬁle used as a query sequence ﬁle

... Additional BLAZE conﬁguration parameters. E.g., setting ‘’output-preﬁx’=’some_preﬁx’‘

is equivalent to specifying ‘–output-preﬁx some_preﬁx‘ in BLAZE; Similarly,

‘overwrite=TRUE‘ is equivalent to switch on the ‘–overwrite‘ option. Note that

the speciﬁed parameters will override the parameters speciﬁed in the conﬁgura-

tion ﬁle. All available options can be found at https://github.com/shimlab/BLAZE.

Value

A data.frame summarising the reads aligned. Other outputs are written to disk. The details of the

output ﬁles can be found at https://github.com/shimlab/BLAZE.

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

fastq1_url <- 'https://raw.githubusercontent.com/shimlab/BLAZE/main/test/data/FAR20033_pass_51e510db_100.fastq'

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', fastq1_url))]]

outdir <- tempfile()

dir.create(outdir)

## Not run:

blaze(expect_cells=10, fastq1, overwrite=TRUE)

## End(Not run)

bulk_long_pipeline 5

bulk_long_pipeline Pipeline for Bulk Data

Description

Semi-supervised isofrom detection and annotation for long read data. This variant is meant for bulk

samples. Speciﬁc parameters relating to analysis can be changed either through function arguments,

or through a conﬁguration JSON ﬁle.

Usage

bulk_long_pipeline(

annotation,

fastq,

outdir,

genome_fa,

minimap2 = NULL,

k8 = NULL,

config_file = NULL

)

Arguments

annotation The ﬁle path to the annotation ﬁle in GFF3 format

fastq The ﬁle path to input fastq ﬁle

outdir The path to directory to store all output ﬁles.

genome_fa The ﬁle path to genome fasta ﬁle.

minimap2 Path to minimap2, if it is not in PATH. Only required if either or both of do_genome_align

and do_read_realign are TRUE.

k8 Path to the k8 Javascript shell binary. Only required if do_genome_align is

TRUE.

config_file File path to the JSON conﬁguration ﬁle. If speciﬁed, config_file overrides all

conﬁguration parameters

Details

By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align),

FLAMES summarizes the alignment for each read by grouping reads with similar splice junctions

to get a raw isoform annotation (do_isoform_id). The raw isoform annotation is compared against

the reference annotation to correct potential splice site and transcript start/end errors. Transcripts

that have similar splice junctions and transcript start/end to the reference transcript are merged

with the reference. This process will also collapse isoforms that are likely to be truncated tran-

scripts. If isoform_id_bambu is set to TRUE, bambu::bambu will be used to generate the up-

dated annotations. Next is the read realignment step (do_read_realign), where the sequence of

each transcript from the update annotation is extracted, and the reads are realigned to this updated

6 bulk_long_pipeline

transcript_assembly.fa by minimap2. The transcripts with only a few full-length aligned reads

are discarded. The reads are assigned to transcripts based on both alignment score, fractions of

reads aligned and transcript coverage. Reads that cannot be uniquely assigned to transcripts or have

low transcript coverage are discarded. The UMI transcript count matrix is generated by collapsing

the reads with the same UMI in a similar way to what is done for short-read scRNA-seq data, but

allowing for an edit distance of up to 2 by default. Most of the parameters, such as the minimal

distance to splice site and minimal percentage of transcript coverage can be modiﬁed by the JSON

conﬁguration ﬁle (config_file).

The default parameters can be changed either through the function arguments are through the con-

ﬁguration JSON ﬁle config_file. the pipeline_parameters section speciﬁes which steps are to

be executed in the pipeline - by default, all steps are executed. The isoform_parameters section

affects isoform detection - key parameters include:

Min_sup_cnt which causes transcripts with less reads aligned than it’s value to be discarded

MAX_TS_DIST which merges transcripts with the same intron chain and TSS/TES distace less than

MAX_TS_DIST

strand_specific which speciﬁes if reads are in the same strand as the mRNA (1), or the reverse

complemented (-1) or not strand speciﬁc (0), which results in strand information being based

on reference annotation.

Value

if do_transcript_quantification set to true, bulk_long_pipeline returns a SummarizedEx-

periment object, containing a count matrix as an assay, gene annotations under metadata, as well

as a list of the other output ﬁles generated by the pipeline. The pipeline also outputs a number of

output ﬁles into the given outdir directory. These output ﬁles generated by the pipeline are:

transcript_count.csv.gz - a transcript count matrix (also contained in the SummarizedExperiment)

isoform_annotated.ﬁltered.gff3 - isoforms in gff3 format (also contained in the SummarizedEx-

periment)

transcript_assembly.fa - transcript sequence from the isoforms

align2genome.bam - sorted BAM ﬁle with reads aligned to genome

realign2transcript.bam - sorted realigned BAM ﬁle using the transcript_assembly.fa as reference

tss_tes.bedgraph - TSS TES enrichment for all reads (for QC)

if do_transcript_quantification set to false, nothing will be returned

See Also

sc_long_pipeline() for single cell data, SummarizedExperiment() for how data is outputted

Examples

# download the two fastq files, move them to a folder to be merged together

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <-

"https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"

combine_sce 7

# download the required fastq files, and move them to new folder

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]]

fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]]

fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together

dir.create(fastq_dir)

file.copy(c(fastq1, fastq2), fastq_dir)

unlink(c(fastq1, fastq2)) # the original files can be deleted

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

se <- bulk_long_pipeline(

annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa,

config_file = system.file("extdata/SIRV_config_default.json", package = "FLAMES")

)

se_2 <- create_se_from_dir(outdir = outdir, annotation = annotation)

}

combine_sce Combine SCE

Description

Combine long- and short-read SingleCellExperiment objects

Usage

combine_sce(

short_read_large,

short_read_small,

long_read_sce,

remove_duplicates = FALSE

)

Arguments

short_read_large

The SCE object, or path to the HDF5 ﬁle, or folder containing the matrix ﬁle,

corresponding to the larger short-read sample

short_read_small

The SCE object, or path to the HDF5 ﬁle, or folder containing the matrix ﬁle,

corresponding to the smaller short-read sample

long_read_sce The SCE object of the transcript counts, from the long-read pipelines.

remove_duplicates

determines whether cells with duplicated barcodes aer kept in the smaller library

( they are always removed from the larger library)

8 create_conﬁg

Details

Takes the long-read SCE object from the long-read pipeline and the short-read SCE object, cre-

ates a MultiAssayExperiment object with the two SingleCellExperiment objects. Cells with

duplicated barcodes are removed from the larger library.

Value

A MultiAssayExperiment object, with ’gene_counts’ and ’transcript_counts’ experiments.

Examples

library(SingleCellExperiment)

a <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10)))

b <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10)))

long_read <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10)))

colData(a)$Barcode <- paste0(1:10, '-1')

colData(b)$Barcode <- paste0(8:17, '-1')

colnames(long_read) <- as.character(2:11)

rownames(a) <- as.character(101:110)

rownames(b) <- as.character(103:112)

rownames(long_read) <- as.character(1001:1010)

combine_sce(short_read_large = a, short_read_small = b, long_read_sce = long_read)

create_config Create Conﬁguration File From Arguments

Description

Create Conﬁguration File From Arguments

Usage

create_config(outdir, type = "sc_3end", ...)

Arguments

outdir the destination directory for the conﬁguratio nﬁle

type use an example conﬁg, available values:

"sc_3end" - conﬁg for 10x 3’ end ONT reads

"SIRV" - conﬁg for the SIRV example reads

... Conﬁguration parameters.

seed - Integer. Seed for minimap2.

threads - Number of threads to use.

do_barcode_demultiplex - Boolean. Speciﬁes whether to run the barcode de-

multiplexing step.

create_conﬁg 9

do_genome_alignment - Boolean. Speciﬁes whether to run the genome align-

ment step. TRUE is recommended

do_gene_quantiﬁcation - Boolean. Speciﬁes whether to run gene quantiﬁca-

tion using the genome alignment results. TRUE is recommended

do_isoform_identiﬁcation - Boolean. Speciﬁes whether to run the isoform

identiﬁcation step. TRUE is recommended

bambu_isoform_identiﬁcation - Boolean. Whether to use Bambu for isoform

identiﬁcation.

multithread_isoform_identiﬁcation - Boolean. Whether to use FLAMES’

new multithreaded Cpp implementation for isoform identiﬁcation.

do_read_realignment - Boolean. Speciﬁes whether to run the read realign-

ment step. TRUE is recommended

do_transcript_quantiﬁcation - Boolean. Speciﬁes whether to run the tran-

script quantiﬁcation step. TRUE is recommended

barcode_parameters - List. Parameters for barcode demultiplexing passed to

find_barcode (except fastq, barcodes_file, stats_out, reads_out)

and threads, which are set by the pipeline, see ?find_barcode for more

details.

generate_raw_isoform - Boolean. Whether to generate all isoforms for debug-

ging purpose.

max_dist - Maximum distance allowed when merging splicing sites in isoform

consensus clustering.

max_ts_dist - Maximum distance allowed when merging transcript start/end

position in isoform consensus clustering.

max_splice_match_dist - Maximum distance allowed when merging splice

site called from the data and the reference annotation.

min_ﬂ_exon_len - Minimum length for the ﬁrst exon outside the gene body in

reference annotation. This is to correct the alignment artifact

max_site_per_splice - Maximum transcript start/end site combinations allowed

per splice chain

min_sup_cnt - Minimum number of read support an isoform decrease this

number will signiﬁcantly increase the number of isoform detected.

min_cnt_pct - Minimum percentage of count for an isoform relative to total

count for the same gene.

min_sup_pct - Minimum percentage of count for an splice chain that support

a given transcript start/end site combination.

strand_speciﬁc - 0, 1 or -1. 1 indicates if reads are in the same strand as mRNA,

-1 indicates reads are reverse complemented, 0 indicates reads are not strand

speciﬁc.

remove_incomp_reads - The strenge of truncated isoform ﬁltering. larger

number means more stringent ﬁltering.

use_junctions - whether to use known splice junctions to help correct the align-

ment results

no_ﬂank - Boolean. for synthetic spike-in data. refer to Minimap2 document

for detail

10 create_sce_from_dir

use_annotation - Boolean. whether to use reference to help annotate known

isoforms

min_tr_coverage - Minimum percentage of isoform coverage for a read to be

aligned to that isoform

min_read_coverage - Minimum percentage of read coverage for a read to be

uniquely aligned to that isoform

Details

Create a list object containing the arguments supplied in a format usable for the FLAMES pipeline.

Also writes the object to a JSON ﬁle, which is located with the preﬁx ’conﬁg_’ in the supplied

outdir. Default values from extdata/config_sclr_nanopore_3end.json will be used for un-

provided parameters.

Value

ﬁle path to the conﬁg ﬁle created

Examples

# create the default configuration file

outdir <- tempdir()

config <- create_config(outdir)

create_sce_from_dir Create SingleCellExperiment object from FLAMES output folder

Description

Create SingleCellExperiment object from FLAMES output folder

Usage

create_sce_from_dir(outdir, annotation)

Arguments

outdir The folder containing FLAMES output ﬁles

annotation (Optional) the annotation ﬁle that was used to produce the output ﬁles

Value

a list of SingleCellExperiment objects if multiple transcript matrices were found in the output

folder, or a SingleCellExperiment object if only one were found

create_se_from_dir 11

Examples

outdir <- tempfile()

dir.create(outdir)

bc_allow <- file.path(outdir, "bc_allow.tsv")

genome_fa <- file.path(outdir, "rps24.fa")

R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE)

R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE)

annotation <- system.file("extdata/rps24.gtf.gz", package = "FLAMES")

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

sce <- FLAMES::sc_long_pipeline(

genome_fa = genome_fa,

fastq = system.file("extdata/fastq", package = "FLAMES"),

annotation = annotation,

outdir = outdir,

barcodes_file = bc_allow

)

sce_2 <- create_sce_from_dir(outdir, annotation)

}

create_se_from_dir Create SummarizedExperiment object from FLAMES output folder

Description

Create SummarizedExperiment object from FLAMES output folder

Usage

create_se_from_dir(outdir, annotation)

Arguments

outdir The folder containing FLAMES output ﬁles

annotation (Optional) the annotation ﬁle that was used to produce the output ﬁles

Value

a SummarizedExperiment object

Examples

# download the two fastq files, move them to a folder to be merged together

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <-

"https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"

# download the required fastq files, and move them to new folder

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]]

12 cutadapt

fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]]

fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together

dir.create(fastq_dir)

file.copy(c(fastq1, fastq2), fastq_dir)

unlink(c(fastq1, fastq2)) # the original files can be deleted

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

se <- bulk_long_pipeline(

annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa,

config_file = system.file("extdata/SIRV_config_default.json", package = "FLAMES")

)

se_2 <- create_se_from_dir(outdir = outdir, annotation = annotation)

}

cutadapt cutadapt wrapper

Description

trim TSO adaptor with cutadapt

Usage

cutadapt(args)

Arguments

args arguments to be passed to cutadapt

Value

Exit code of cutadapt

Examples

## Not run:

cutadapt("-h")

## End(Not run)

demultiplex_sockeye 13

demultiplex_sockeye Demultiplex reads using Sockeye outputs

Description

Demultiplex reads using the cell_umi_gene.tsv ﬁle from Sockeye.

Usage

demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)

Arguments

fastq_dir The folder containing FASTQ ﬁles from Sockeye’s output under ingest/chunked_fastqs.

sockeye_tsv The cell_umi_gene.tsv ﬁle from Sockeye.

out_fq The output FASTQ ﬁle.

Value

returns NULL

filter_annotation ﬁlter annotation for plotting coverages

Description

Removes isoform annotations that could produce ambigious reads, such as isoforms that only differ

by the 5’ / 3’ end. This could be useful for plotting average coverage plots.

Usage

filter_annotation(annotation, keep = "tss_differ")

Arguments

annotation path to the GTF annotation ﬁle, or the parsed GenomicRanges object.

keep string, one of ’tss_differ’ (only keep isoforms that all differ by the transcription

start site position), ’tes_differ’ (only keep those that differ by the transcription

end site position), ’both’ (only keep those that differ by both the start and end

site), or ’single_transcripts’ (only keep genes that contains a sinlge transcript).

Value

GenomicRanges of the ﬁltered isoforms

14 ﬁnd_barcode

Examples

filtered_annotation <- filter_annotation(system.file('extdata/rps24.gtf.gz', package = 'FLAMES'), keep = 'tes_differ')

filtered_annotation

find_barcode Match Cell Barcodes

Description

demultiplex reads with ﬂexiplex

Usage

find_barcode(

fastq,

barcodes_file,

max_bc_editdistance = 2,

max_flank_editdistance = 8,

reads_out,

stats_out,

threads = 1,

pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse =

""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse

= "")),

TSO_seq = "",

TSO_prime = 3,

full_length_only = FALSE

)

Arguments

fastq input FASTQ ﬁle path

barcodes_file path to ﬁle containing barcode allow-list, with one barcode in each line

max_bc_editdistance

max edit distances for the barcode sequence

max_flank_editdistance

max edit distances for the ﬂanking sequences (primer and polyT)

reads_out path of output FASTQ ﬁle

stats_out path of output stats ﬁle

threads number of threads to be used

pattern named character vector deﬁning the barcode pattern

TSO_seq TSO sequence to be trimmed

TSO_prime either 3 (when TSO_seq is on 3’ the end) or 5 (on 5’ end)

full_length_only

boolean, when TSO sequence is provided, whether reads without TSO are to be

discarded

ﬁnd_isoform 15

Value

invisible()

Examples

outdir <- tempfile()

dir.create(outdir)

bc_allow <- file.path(outdir, "bc_allow.tsv")

R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE)

find_barcode(

fastq = system.file("extdata/fastq", package = "FLAMES"),

stats_out = file.path(outdir, "bc_stat"),

reads_out = file.path(outdir, "demultiplexed.fq.gz"),

barcodes_file = bc_allow

)

find_isoform Isoform identiﬁcation

Description

Long-read isoform identiﬁcation with FLAMES or bambu.

Usage

find_isoform(annotation, genome_fa, genome_bam, outdir, config)

Arguments

annotation Path to annotation ﬁle. If conﬁgured to use bambu, the annotation must be

provided as GTF ﬁle.

genome_fa The ﬁle path to genome fasta ﬁle.

genome_bam File path to BAM alignment ﬁle. Multiple ﬁles could be provided.

outdir The path to directory to store all output ﬁles.

config Parsed FLAMES conﬁgurations.

Value

The updated annotation and the transcriptome assembly will be saved in the output folder as isoform_annotated.gff3

(GTF if bambu is selected) and transcript_assembly.fa respectively.

16 ﬁnd_variants

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]]

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

config <- jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES"))

minimap2_align(

config = config,

fa_file = genome_fa,

fq_in = fastq1,

annot = annotation,

outdir = outdir

)

## Not run:

find_isoform(

annotation = annotation, genome_fa = genome_fa,

genome_bam = file.path(outdir, "align2genome.bam"),

outdir = outdir, config = config

)

## End(Not run)

}

find_variants bulk variant identiﬁcation

Description

Treat each bam ﬁle as a bulk sample and identify variants against the reference

Usage

find_variants(

bam_path,

reference,

annotation,

min_nucleotide_depth = 100,

homopolymer_window = 3,

annotated_region_only = FALSE,

names_from = "gene_name",

threads = 1

)

ﬁnd_variants 17

Arguments

bam_path character(1) or character(n): path to the bam ﬁle(s) aligned to the reference

genome (NOT the transcriptome!).

reference DNAStringSet: the reference genome

annotation GRanges: the annotation of the reference genome. You can load a GTF/GFF

annotation ﬁle with anno <- rtracklayer::import(file).

min_nucleotide_depth

integer(1): minimum read depth for a position to be considered a variant.

homopolymer_window

integer(1): the window size to calculate the homopolymer percentage. The ho-

mopolymer percentage is calculated as the percentage of the most frequent nu-

cleotide in a window of -homopolymer_window to homopolymer_window nu-

cleotides around the variant position, excluding the variant position itself. Cal-

culation of the homopolymer percentage is skipped when homopolymer_window

= 0. This is useful for ﬁltering out Nanopore sequencing errors in homopolymer

regions.

annotated_region_only

logical(1): whether to only consider variants outside annotated regions. If TRUE,

only variants outside annotated regions will be returned. If FALSE, all variants

will be returned, which could take signiﬁcantly longer time.

names_from character(1): the column name in the metadata column of the annotation (mcols(annotation)[,

names_from]) to use for the region column in the output.

threads integer(1): number of threads to use. Threading is done over each annotated re-

gion and (if annotated_region_only = FALSE) unannotated gaps for each bam

ﬁle.

Details

Each bam ﬁle is treated as a bulk sample to perform pileup and identify variants. You can run

sc_mutations with the variants identiﬁed with this function to get single-cell allele counts. Note

that reference genome FASTA ﬁles may have the chromosome names ﬁeld as ‘>chr1 1‘ instead

of ‘>chr1‘. You may need to remove the trailing number to match the chromosome names in

the bam ﬁle, for example with names(ref) <- sapply(names(ref), function(x) strsplit(x,

" ")[[1]][1]).

Value

A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region, homopolymer_pct,

bam_path The homopolymer percentage is calculated as the percentage of the most frequent nu-

cleotide in a window of homopolymer_window nucleotides around the variant position, excluding

the variant position itself.

Examples

outdir <- tempfile()

dir.create(outdir)

genome_fa <- file.path(outdir, "rps24.fa")

18 ﬂexiplex

R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE)

download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq",

destfile = file.path(outdir, "demultipelxed.fq")

) # can't be bothered to run demultiplexing again

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

minimap2_align( # align to genome

config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")),

fa_file = genome_fa,

fq_in = file.path(outdir, "demultipelxed.fq"),

annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"),

outdir = outdir

)

variants <- find_variants(

bam_path = file.path(outdir, "align2genome.bam"),

reference = genome_fa,

annotation = GenomicRanges::GRanges("chr14", IRanges::IRanges(1, 1)),

min_nucleotide_depth = 10

)

head(variants)

}

FLAMES FLAMES: full-length analysis of mutations and splicing

Description

FLAMES: full-length analysis of mutations and splicing

flexiplex Rcpp port of ﬂexiplex

Description

demultiplex reads with ﬂexiplex, for detailed description, see documentation for the original ﬂexi-

plex: https://davidsongroup.github.io/ﬂexiplex

Usage

flexiplex(

reads_in,

barcodes_file,

bc_as_readid,

max_bc_editdistance,

max_flank_editdistance,

pattern,

reads_out,

stats_out,

get_GRangesList 19

bc_out,

n_threads

)

Arguments

reads_in Input FASTQ or FASTA ﬁle

barcodes_file barcode allow-list ﬁle

bc_as_readid bool, whether to add the demultiplexed barcode to the read ID ﬁeld

max_bc_editdistance

max edit distance for barcode ’

max_flank_editdistance

max edit distance for the ﬂanking sequences ’

pattern StringVector deﬁning the barcode structure, see [ﬁnd_barcode]

reads_out output ﬁle for demultiplexed reads

stats_out output ﬁle for demultiplexed stats

bc_out WIP

n_threads number of threads to be used during demultiplexing

Value

integer return value. 0 represents normal return.

get_GRangesList Parse FLAMES’ GFF output

Description

Parse FLAMES’ GFF ouputs into a Genomic Ranges List

Usage

get_GRangesList(file)

Arguments

file the GFF ﬁle to parse

Value

A Genomic Ranges List

20 minimap2_align

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]]

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

config <- jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES"))

minimap2_align(

config = config,

fa_file = genome_fa,

fq_in = fastq1,

annot = annotation,

outdir = outdir

)

find_isoform(

annotation = annotation, genome_fa = genome_fa,

genome_bam = file.path(outdir, "align2genome.bam"),

outdir = outdir, config = config

)

grlist <- get_GRangesList(file = file.path(outdir, "isoform_annotated.gff3"))

}

minimap2_align Minimap2 Align to Genome

Description

Uses minimap2 to align sequences agains a reference databse. Uses options ’-ax splice -t 12 -k14

–secondary=no fa_file fq_in’

Usage

minimap2_align(

config,

fa_file,

fq_in,

annot,

outdir,

minimap2 = NA,

k8 = NA,

samtools = NA,

prefix = NULL,

threads = 1

)

minimap2_align 21

Arguments

config Parsed list of FLAMES conﬁg ﬁle

fa_file Path to the fasta ﬁle used as a reference database for alignment

fq_in File path to the fastq ﬁle used as a query sequence ﬁle

annot Genome annotation ﬁle used to create junction bed ﬁles

outdir Output folder

minimap2 Path to minimap2 binary

k8 Path to the k8 Javascript shell binary

samtools path to the samtools binary, required for large datasets since Rsamtools does

not support CSI indexing

prefix String, the preﬁx (e.g. sample name) for the outputted BAM ﬁle

threads Integer, threads for minimap2 to use, see minimap2 documentation for details,

FLAMES will try to detect cores if this parameter is not provided.

Value

a data.frame summarising the reads aligned

See Also

[minimap2_realign()]

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data'

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]]

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

minimap2_align(

config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')),

fa_file = genome_fa,

fq_in = fastq1,

annot = annotation,

outdir = outdir

)

}

22 minimap2_realign

minimap2_realign Minimap2 re-align reads to transcriptome

Description

Uses minimap2 to re-align reads to transcriptome

Usage

minimap2_realign(

config,

fq_in,

outdir,

minimap2,

samtools = NULL,

prefix = NULL,

threads = 1

)

Arguments

config Parsed list of FLAMES conﬁg ﬁle

fq_in File path to the fastq ﬁle used as a query sequence ﬁle

outdir Output folder

minimap2 Path to minimap2 binary

samtools path to the samtools binary, required for large datasets since Rsamtools does

not support CSI indexing

prefix String, the preﬁx (e.g. sample name) for the outputted BAM ﬁle

threads Integer, threads for minimap2 to use, see minimap2 documentation for details,

FLAMES will try to detect cores if this parameter is not provided.

Value

a data.frame summarising the reads aligned

See Also

[minimap2_align()]

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data'

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]]

parse_gff_tree 23

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]]

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

fasta <- annotation_to_fasta(annotation, genome_fa, outdir)

minimap2_realign(

config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')),

fq_in = fastq1,

outdir = outdir

)

}

parse_gff_tree Parse Gff3 ﬁle

Description

Parse a Gff3 ﬁle into 3 components: chromasome to gene name, a transcript dictionary, a gene to

transcript dictionary and a transcript to exon dictionary. These components are returned in a named

list.

Usage

parse_gff_tree(gff_file)

Arguments

gff_file the ﬁle path to the gff3 ﬁle to parse

Value

a named list with the elements "chr_to_gene", "transcript_dict", "gene_to_transcript", "transcript_to_exon",

containing the data parsed from the gff3 ﬁle.

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <-

"https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"

gff <- bfc[[names(BiocFileCache::bfcadd(bfc, "GFF", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]]

## Not run: parsed_gff <- parse_gff_tree(gff)

24 plot_coverage

plot_coverage plot read coverages

Description

Plot the average read coverages for each length bin or a perticular isoform

Usage

plot_coverage(

bam,

isoform = NULL,

length_bins = c(0, 1, 2, 5, 10, Inf),

weight_fn = "read_counts"

)

Arguments

bam path to the BAM ﬁle (aligning reads to the transcriptome), or the (Genomi-

cAlignments::readGAlignments) parsed GAlignments object

isoform string vector, provide isoform names to plot the coverage for the corresponding

isoforms, or provide NULL to plot average coverages for each length bin

length_bins numeric vector to specify the sizes to bin the isoforms by

weight_fn "read_counts" or "sigmoid", determins how the transcripts should be weighted

within length bins.

Value

a ggplot2 object of the coverage plot(s)

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data'

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]]

outdir <- tempfile()

dir.create(outdir)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

fasta <- annotation_to_fasta(annotation, genome_fa, outdir)

minimap2_realign(

config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')),

fq_in = fastq1,

outdir = outdir

)

plot_coverage(bam = file.path(outdir, 'realign2transcript.bam'))

}

plot_demultiplex 25

plot_demultiplex Plot Cell Barcode demultiplex statistics

Description

produce a barplot of cell barcode demultiplex statistics

Usage

plot_demultiplex(outdir, stats_file)

Arguments

outdir folder containing the matched_barcode_stat ﬁle, or matched_barcode_stat.SAMPLE

ﬁles. Ignored if stats_file is provided.

stats_file matched_barcode_stat ﬁle(s) from which the statistics to be plotted.

Value

a ggplot object of the barcode plot

Examples

outdir <- tempfile()

dir.create(outdir)

bc_allow <- file.path(outdir, "bc_allow.tsv")

R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE)

find_barcode(

fastq = system.file("extdata/fastq", package = "FLAMES"),

stats_out = file.path(outdir, "bc_stat"),

reads_out = file.path(outdir, "demultiplexed.fq.gz"),

barcodes_file = bc_allow

)

plot_demultiplex(stats_file = file.path(outdir, "bc_stat"))

quantify_gene Gene quantiﬁcation

Description

Calculate the per gene UMI count matrix by parsing the genome alignment ﬁle.

26 quantify_gene

Usage

quantify_gene(

annotation,

outdir,

infq,

n_process,

pipeline = "sc_single_sample",

samples = NULL

)

Arguments

annotation The ﬁle path to the annotation ﬁle in GFF3 format

outdir The path to directory to store all output ﬁles.

infq The input FASTQ ﬁle.

n_process The number of processes to use for parallelization.

pipeline The pipeline type as a character string, either sc_single_sample (single-cell,

single-sample),

samples A vector of sample names, default to the ﬁle names of input fastq ﬁles, or folder

names if fastqs is a vector of folders. bulk (bulk, single or multi-sample), or

sc_multi_sample (single-cell, multiple samples)

Details

After the genome alignment step (do_genome_align), the alignment ﬁle will be parsed to generate

the per gene UMI count matrix. For each gene in the annotation ﬁle, the number of reads whose

mapped ranges overlap with the gene’s genome coordinates will be assigned to the gene. For reads

can be assigned to multiple gene, the read will be assigned to the gene with the highest number of

overlapping nucleotides. If the read can be assigned to multiple genes with the same number of

overlapping nucleotides, the read will be not be assigned.

After the read-to-gene assignment, the per gene UMI count matrix will be generated. Speciﬁcally,

for each gene, the reads with similar mapping coordinates of transcript termination sites (TTS, i.e.

the end of the the read with a polyT or polyA) will be grouped together. UMIs of reads in the same

group will be collapsed to generate the UMI counts for each gene.

Finally, a new fastq ﬁle with deduplicated reads by keeping the longest read in each UMI.

Value

The count matrix will be saved in the output folder as transcript_count.csv.gz.

quantify_transcript 27

quantify_transcript Transcript quantiﬁcation

Description

Calculate the transcript count matrix by parsing the re-alignment ﬁle.

Usage

quantify_transcript(annotation, outdir, config, pipeline = "sc_single_sample")

Arguments

annotation The ﬁle path to the annotation ﬁle in GFF3 format

outdir The path to directory to store all output ﬁles.

config Parsed FLAMES conﬁgurations.

pipeline The pipeline type as a character string, either sc_single_sample (single-cell,

single-sample), bulk (bulk, single or multi-sample), or sc_multi_sample (single-

cell, multiple samples)

Value

The count matrix will be saved in the output folder as transcript_count.csv.gz.

Examples

temp_path <- tempfile()

bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE)

file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"

fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]]

genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]]

annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]]

outdir <- tempfile()

dir.create(outdir)

fasta <- annotation_to_fasta(annotation, genome_fa, outdir)

config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1))

file.copy(annotation, file.path(outdir, "isoform_annotated.gtf"))

## Not run:

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

minimap2_realign(

config = config, outdir = outdir,

fq_in = fastq1

)

quantify_transcript(annotation, outdir, config, pipeline = "bulk")

}

## End(Not run)

28 scmixology_lib10_transcripts

scmixology_lib10 scMixology short-read gene counts - sample 2

Description

Short-read gene counts from long and short-read single cell RNA-seq proﬁling of human lung ade-

nocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive character-

ization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome

Biology 22, 310 (2021).

Usage

scmixology_lib10

Format

## ‘scmixology_lib10‘ A SingleCellExperiment with 7,240 rows and 60 columns:

Source

<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>

scmixology_lib10_transcripts

scMixology long-read transcript counts - sample 2

Description

long-read transcript counts from long and short-read single cell RNA-seq proﬁling of human lung

adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive char-

acterization of single-cell full-length isoforms in human and mouse with long-read sequencing.

Genome Biology 22, 310 (2021).

Usage

scmixology_lib10_transcripts

Format

## ‘scmixology_lib10_transcripts‘ A SingleCellExperiment with 7,240 rows and 60 columns:

Source

<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>

scmixology_lib90 29

scmixology_lib90 scMixology short-read gene counts - sample 1

Description

Short-read single cell RNA-seq proﬁling of human lung adenocarcinoma cell lines using 10X ver-

sion 2 chemstry. Single cells from ﬁve human lung adenocarcinoma cell lines (H2228, H1975,

A549, H838 and HCC827) were mixed in equal proportions and processed using the Chromium

10X platform, then sequenced using Illumina HiSeq 2500. See Tian L, Dong X, Freytag S, Lê

Cao KA et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control

experiments. Nat Methods 2019 Jun;16(6):479-487. PMID: 31133762

Usage

scmixology_lib90

Format

## ‘scmixology_lib90‘ A SingleCellExperiment

Source

<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126906>

sc_DTU_analysis FLAMES Differential Transcript Usage Analysis

Description

Chi-square based differential transcription usage analysis. This variant is meant for single cell

data. Takes the SingleCellExperiment object from sc_long_pipeline as input. Alternatively,

the path to the output folder could be provided instead of the SCE object. A cluster annota-

tion ﬁle cluster_annotation.csv is required, please provide this ﬁle under the output folder

of sc_long_pipeline.

Usage

sc_DTU_analysis(sce, min_count = 15)

Arguments

sce The SingleCellExperiment object from sc_long_pipeline, an additional

cluster_annotation.csv ﬁle is required under the output folder of the SCE

object.

min_count The minimum UMI count threshold for ﬁltering isoforms.

30 sc_DTU_analysis

Details

This function will search for genes that have at least two isoforms, each with more than min_count

UMI counts. For each gene, the per cell transcript counts were merged by group to generate pseudo

bulk samples. Grouping is speciﬁed by the cluster_annotation.csv ﬁle. The top 2 highly ex-

pressed transcripts for each group were selected and a UMI count matrix where the rows are se-

lected transcripts and columns are groups was used as input to a chi-square test of independence

(chisq.test). Adjusted P-values were calculated by Benjamini–Hochberg correction.

Value

a data.frame containing the following columns:

gene_id - differentially transcribed genes

X_value - the X value for the DTU gene

df - degrees of freedom of the approximate chi-squared distribution of the test statistic

DTU_tr - the transcript_id with the highest squared residuals

DTU_group - the cell group with the highest squared residuals

p_value - the p-value for the test

adj_p - the adjusted p-value (by Benjamini–Hochberg correction)

The table is sorted by decreasing P-values. It will also be saved as sc_DTU_analysis.csv under

the output folder.

Examples

outdir <- tempfile()

dir.create(outdir)

bc_allow <- file.path(outdir, "bc_allow.tsv")

genome_fa <- file.path(outdir, "rps24.fa")

R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE)

R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

sce <- FLAMES::sc_long_pipeline(

genome_fa = genome_fa,

fastq = system.file("extdata/fastq", package = "FLAMES"),

annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"),

outdir = outdir,

barcodes_file = bc_allow

)

group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1)

write.csv(group_anno, file.path(outdir, "cluster_annotation.csv"), row.names = FALSE)

sc_DTU_analysis(sce, min_count = 1)

}

sc_heatmap_expression 31

sc_heatmap_expression FLAMES heetmap plots

Description

Plot expression heatmap of top n isoforms of a gene

Usage

sc_heatmap_expression(

gene,

multiAssay,

impute = FALSE,

n_isoforms = 4,

transcript_ids,

n_pcs = 40,

isoform_legend_width = 7,

col_low = "#313695",

col_mid = "#FFFFBF",

col_high = "#A50026",

color_quantile = 0.95

)

Arguments

gene The gene symbol of interest.

multiAssay The MultiAssayExperiment object from combine_sce().

impute Whether to impute expression levels for cells without transcript counts

n_isoforms The number of expressed isoforms to keep.

transcript_ids specify the transcript ids instead of selecting the top n_isoforms

n_pcs The number of principal components to generate.

isoform_legend_width

The width of isoform legends in heatmaps, in cm.

col_low Color for cells with low expression levels in UMAPs.

col_mid Color for cells with intermediate expression levels in UMAPs.

col_high Color for cells with high expression levels in UMAPs.

color_quantile The lower and upper expression quantile to be displayed bewteen col_low and

col_high, e.g. with color_quantile = 0.95, cells with expressions higher

than 95% of other cells will all be shown in col_high, and cells with expression

lower than 95% of other cells will all be shown in col_low.

Details

This function takes the combined MultiAssayExperiment object from combine_sce and plots an

expression heatmap with the isoform alignment visualisations.

32 sc_long_multisample_pipeline

Value

a ggplot object of the heatmap

Examples

combined_sce <- combine_sce(

short_read_large = scmixology_lib90,

short_read_small = scmixology_lib10,

long_read_sce = scmixology_lib10_transcripts,

remove_duplicates = FALSE)

sc_heatmap_expression(gene = "ENSG00000108107", multiAssay = combined_sce)

sc_long_multisample_pipeline

Pipeline for Multi-sample Single Cell Data

Description

Semi-supervised isoform detection and annotation for long read data. This variant is for multi-

sample single cell data. By default, this pipeline demultiplexes input fastq data (match_cell_barcode

= TRUE). Speciﬁc parameters relating to analysis can be changed either through function arguments,

or through a conﬁguration JSON ﬁle.

Usage

sc_long_multisample_pipeline(

annotation,

fastqs,

outdir,

genome_fa,

sample_names = NULL,

minimap2 = NULL,

k8 = NULL,

barcodes_file = NULL,

expect_cell_numbers = NULL,

config_file = NULL

)

Arguments

annotation The ﬁle path to the annotation ﬁle in GFF3 format

fastqs The input fastq ﬁles for multiple samples. It can be provided in different way: 1)

a single path to the folder containing fastq ﬁles, each fastq ﬁle will be treated as

a sample; or 2) a vector of paths to each fastq ﬁle, each fastq ﬁle will be treated

as a sample; or 3) a vector of paths to folders containing fastq ﬁles, each folder

will be treated as a sample.

sc_long_multisample_pipeline 33

outdir The path to directory to store all output ﬁles.

genome_fa The ﬁle path to genome fasta ﬁle.

sample_names A vector of sample names, Default to the ﬁle names of input fastq ﬁles, or folder

names if fastqs is a vector of folders.

minimap2 Path to minimap2, if it is not in PATH. Only required if either or both of do_genome_align

and do_read_realign are TRUE.

k8 Path to the k8 Javascript shell binary. Only required if do_genome_align is

TRUE.

barcodes_file The ﬁle path to the reference csv used for demultiplexing in ﬂexiplex. If not

speciﬁed, the demultiplexing will be performed using BLAZE. Default is NULL.

expect_cell_numbers

A vector of roughly expected numbers of cells in each sample E.g., the targeted

number of cells. Required if using BLAZE for demultiplexing, speciﬁcally,

when the do_barcode_demultiplex are TRUE in the the JSON conﬁguration

ﬁle and barcodes_file is not speciﬁed. Default is NULL.

config_file File path to the JSON conﬁguration ﬁle. If speciﬁed, config_file overrides all

conﬁguration parameters

Details

By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align),

FLAMES summarizes the alignment for each read in every sample by grouping reads with similar

splice junctions to get a raw isoform annotation (do_isoform_id). The raw isoform annotation is

compared against the reference annotation to correct potential splice site and transcript start/end

errors. Transcripts that have similar splice junctions and transcript start/end to the reference tran-

script are merged with the reference. This process will also collapse isoforms that are likely to be

truncated transcripts. If isoform_id_bambu is set to TRUE, bambu::bambu will be used to generate

the updated annotations (Not implemented for multi-sample yet). Next is the read realignment step

(do_read_realign), where the sequence of each transcript from the update annotation is extracted,

and the reads are realigned to this updated transcript_assembly.fa by minimap2. The tran-

scripts with only a few full-length aligned reads are discarded (Not implemented for multi-sample

yet). The reads are assigned to transcripts based on both alignment score, fractions of reads aligned

and transcript coverage. Reads that cannot be uniquely assigned to transcripts or have low transcript

coverage are discarded. The UMI transcript count matrix is generated by collapsing the reads with

the same UMI in a similar way to what is done for short-read scRNA-seq data, but allowing for an

edit distance of up to 2 by default. Most of the parameters, such as the minimal distance to splice

site and minimal percentage of transcript coverage can be modiﬁed by the JSON conﬁguration ﬁle

(config_file).

The default parameters can be changed either through the function arguments are through the con-

ﬁguration JSON ﬁle config_file. the pipeline_parameters section speciﬁes which steps are to

be executed in the pipeline - by default, all steps are executed. The isoform_parameters section

affects isoform detection - key parameters include:

Min_sup_cnt which causes transcripts with less reads aligned than it’s value to be discarded

MAX_TS_DIST which merges transcripts with the same intron chain and TSS/TES distace less than

MAX_TS_DIST

34 sc_long_pipeline

strand_specific which speciﬁes if reads are in the same strand as the mRNA (1), or the reverse

complemented (-1) or not strand speciﬁc (0), which results in strand information being based

on reference annotation.

Value

a list of SingleCellExperiment objects if "do_transcript_quantiﬁcation" set to true. Otherwise

nothing will be returned.

See Also

bulk_long_pipeline() for bulk long data, SingleCellExperiment() for how data is outputted

Examples

reads <- ShortRead::readFastq(system.file("extdata/fastq/musc_rps24.fastq.gz", package = "FLAMES"))

outdir <- tempfile()

dir.create(outdir)

dir.create(file.path(outdir, "fastq"))

bc_allow <- file.path(outdir, "bc_allow.tsv")

genome_fa <- file.path(outdir, "rps24.fa")

R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE)

R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE)

ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE)

ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE)

ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE)

if (!any(is.na(sys_which(c("minimap2", "k8"))))) {

sce_list <- FLAMES::sc_long_multisample_pipeline(

annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"),

fastqs = file.path(outdir, "fastq", list.files(file.path(outdir, "fastq"))),

outdir = outdir,

genome_fa = genome_fa,

barcodes_file = rep(bc_allow, 3)

)

}

sc_long_pipeline Pipeline for Single Cell Data

Description

Semi-supervised isoform detection and annotation for long read data. This variant is for single cell

data. By default, this pipeline demultiplexes input fastq data (match_cell_barcode = TRUE). Spe-

ciﬁc parameters relating to analysis can be changed either through function arguments, or through

a conﬁguration JSON ﬁle.

sc_long_pipeline 35

Usage

sc_long_pipeline(

annotation,

fastq,

genome_bam = NULL,

outdir,

genome_fa,

minimap2 = NULL,

k8 = NULL,

barcodes_file = NULL,

expect_cell_number = NULL,

config_file = NULL

)

Arguments

annotation The ﬁle path to the annotation ﬁle in GFF3 format

fastq The ﬁle path to input fastq ﬁle

genome_bam Optional ﬁle path to a bam ﬁle to use instead of fastq ﬁle (skips initial alignment

step)

outdir The path to directory to store all output ﬁles.

genome_fa The ﬁle path to genome fasta ﬁle.

minimap2 Path to minimap2, if it is not in PATH. Only required if either or both of do_genome_align

and do_read_realign are TRUE.

k8 Path to the k8 Javascript shell binary. Only required if do_genome_align is

TRUE.

barcodes_file The ﬁle path to the reference csv used for demultiplexing in ﬂexiplex. If not

speciﬁed, the demultiplexing will be performed using BLAZE. Default is NULL.

expect_cell_number

Expected number of cells for identifying the barcode list in BLAZE. This could

be just a rough estimate. E.g., the targeted number of cells. Required if the

do_barcode_demultiplex are TRUE in the the JSON conﬁguration ﬁle and

barcodes_file is not speciﬁed. Default is NULL.