--sjdbGTFfile specifies the path to the file with annotated transcripts in the standard GTF
format. STAR will extract splice junctions from this file and use them to greatly improve
accuracy of the mapping. While this is optional, and STAR can be run without annotations,
using annotations is highly recommended whenever they are available. Starting from 2.4.1a,
the annotations can also be included on the fly at the mapping step.
--sjdbOverhang specifies the length of the genomic sequence around the annotated junction
to be used in constructing the splice junctions database. Ideally, this length should be equal
to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina
2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the
ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as
well as the ideal value.
Genome files comprise binary genome sequence, suffix arrays, text chromosome names/lengths,
splice junctions coordinates, and transcripts/genes information. Most of these files use internal
STAR format and are not intended to be utilized by the end user. It is strongly not recommended
to change any of these file with one exception: you can rename the chromosome names in the
chrName.txt keeping the order of the chromosomes in the file: the names from this file will be used
in all output files (e.g. SAM/BAM).
2.2 Advanced options.
2.2.1 Which chromosomes/scaffolds/patches to include?
It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,)
as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a
few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal
RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds
are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes.
Generally, patches and alternative haplotypes should not be included in the genome.
Examples of acceptable genome sequence files:
• ENSEMBL: files marked with .dna.primary.assembly, such as: ftp://ftp.ensembl.
org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_
assembly.fa.gz
• GENCODE: files marked with PRI (primary). Strongly recommended for mouse and human:
http://www.gencodegenes.org/.
2.2.2 Which annotations to use?
The use of the most comprehensive annotations for a given species is strongly recommended. Very
importantly, chromosome names in the annotations GTF file have to match chromosome names in the
FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL
GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses chr1, chr2,
... naming convention, and ENSEMBL uses 1, 2, ... naming, the ENSEMBL and UCSC FASTA
and GTF files cannot be mixed together, unless chromosomes are renamed to match between the
FASTA anf GTF files.
6