sc_long_multisample_pipeline 33
outdir The path to directory to store all output files.
genome_fa The file path to genome fasta file.
sample_names A vector of sample names, Default to the file names of input fastq files, or folder
names if fastqs is a vector of folders.
minimap2 Path to minimap2, if it is not in PATH. Only required if either or both of do_genome_align
and do_read_realign are TRUE.
k8 Path to the k8 Javascript shell binary. Only required if do_genome_align is
TRUE.
barcodes_file The file path to the reference csv used for demultiplexing in flexiplex. If not
specified, the demultiplexing will be performed using BLAZE. Default is NULL.
expect_cell_numbers
A vector of roughly expected numbers of cells in each sample E.g., the targeted
number of cells. Required if using BLAZE for demultiplexing, specifically,
when the do_barcode_demultiplex are TRUE in the the JSON configuration
file and barcodes_file is not specified. Default is NULL.
config_file File path to the JSON configuration file. If specified, config_file overrides all
configuration parameters
Details
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align),
FLAMES summarizes the alignment for each read in every sample by grouping reads with similar
splice junctions to get a raw isoform annotation (do_isoform_id). The raw isoform annotation is
compared against the reference annotation to correct potential splice site and transcript start/end
errors. Transcripts that have similar splice junctions and transcript start/end to the reference tran-
script are merged with the reference. This process will also collapse isoforms that are likely to be
truncated transcripts. If isoform_id_bambu is set to TRUE, bambu::bambu will be used to generate
the updated annotations (Not implemented for multi-sample yet). Next is the read realignment step
(do_read_realign), where the sequence of each transcript from the update annotation is extracted,
and the reads are realigned to this updated transcript_assembly.fa by minimap2. The tran-
scripts with only a few full-length aligned reads are discarded (Not implemented for multi-sample
yet). The reads are assigned to transcripts based on both alignment score, fractions of reads aligned
and transcript coverage. Reads that cannot be uniquely assigned to transcripts or have low transcript
coverage are discarded. The UMI transcript count matrix is generated by collapsing the reads with
the same UMI in a similar way to what is done for short-read scRNA-seq data, but allowing for an
edit distance of up to 2 by default. Most of the parameters, such as the minimal distance to splice
site and minimal percentage of transcript coverage can be modified by the JSON configuration file
(config_file).
The default parameters can be changed either through the function arguments are through the con-
figuration JSON file config_file. the pipeline_parameters section specifies which steps are to
be executed in the pipeline - by default, all steps are executed. The isoform_parameters section
affects isoform detection - key parameters include:
Min_sup_cnt which causes transcripts with less reads aligned than it’s value to be discarded
MAX_TS_DIST which merges transcripts with the same intron chain and TSS/TES distace less than
MAX_TS_DIST