Coding Exon-Structure Aware Realigner (CESAR): Utilizing

Chapter 10

Genome Alignments for Comparative Gene Annotation

Virag Sharma and Michael Hiller

Abstract

Alignment-based gene identiﬁcation methods utilize sequence conservation between orthologous protein-

coding genes to annotate genes in newly sequenced genomes. CESAR is an approach that makes use of

existing genome alignments to transfer genes from one genome to other aligned genomes, and thus

generates comparative gene annotations. To accurately detect conserved exons that exhibit an intact

reading frame and consensus splice sites, CESAR produces a new alignment between orthologous exons,

taking information about the exon’s reading frame and splice site positions into account. Furthermore,

CESAR is able to detect most evolutionary splice site shifts, which helps to annotate exon boundaries at

high precision. Here, we describe how to apply CESAR to generate comparative gene annotations for one

or many species, and discuss the strengths and limitations of this approach. CESAR is available at https://

github.com/hillerlab/CESAR2.0.

Key words Comparative gene annotation, Genome alignment, CESAR, Splice site shift

1 Introduction

Identifying coding genes in genomic sequences is an important step

in annotating a genome. Several different approaches exist for this

task [1]. Transcriptome-based methods align entire or parts of

sequenced mRNAs to the genome to infer exons and introns. Ab

initio gene prediction methods detect genes solely based on char-

acteristic sequence patterns. Homology-based approaches utilize

the fact that homologous genes often have conserved sequences

and use information about genes in a related species to search for

similar sequences in the given genome.

One type of homology-based approaches makes use of align-

ments between entire genomes to project (or map) an existing gene

annotation of a “reference” species to an aligned “query” species

that lacks a gene annotation [2]. These projection approaches

assume that exons of the reference species that align well to the

query species are likely homologous exons. Thus, the coordinates

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962,

https://doi.org/10.1007/978-1-4939-9173-0_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019

179

[email protected]

of aligned exon boundaries in the query genome reveal the location

of likely homologous exons (Fig. 1a).

Utilizing genome alignments for projecting gene annotations

has several advantages. First, genome alignments do not only align

exons but also the surrounding genomic context, which is helpful

to distinguish orthologs from paralogs or processed pseudogenes as

the latter are often located in a different syntenic context. Second,

many protein-coding exons are conserved over large phylogenetic

distances. If sensitive alignment parameters are used, genome align-

ments capture the majority of human coding exons in other mam-

mals and even other vertebrates [3]. Third, by making use of

existing multiple genome alignments, gene annotations can be

projected to numerous query species, as we recently demonstrated

by projecting human genes to 143 other vertebrates [3].

Despite their utility for projecting gene annotations, genome

alignments have two serious limitations. First, genome alignment

programs are not aware of the reading frame and splice sites of the

reference exon. Consequently, alignments between conserved

exons may incorrectly exhibit frameshifts or non-consensus splice

sites due to alignment ambiguities. Since one aims at projecting

only truly conserved exons that exhibit an intact reading frame and

consensus splice sites in the query species, such alignment

ccagcgcagcgggtgcggcgATGATCCTGGAGGAGAGGCCGGACGGCGCGGGCGCCGGC

gtgcgcagaactggcgcggcggcggga-----ggagg

gcgtccgagcga--gcagcgATGATCCTTGAGGAGAGGCCAGATGGCCAGGGCACTGGC

GAGGAGAGCCCGCGGCTGCAG

GAGGAAAGCTCTCGGCCGCAGgacgacggcagcatccgcaaggtgggggctgagcagg

ccagcgcagcgggtgcggcgATGATCCTGGAGGAGAGGCCGGACGGCGCGGGCGCCGGC

cagcgtccgagcgagcagcgATGATCCTTGAGGAGAGGCCAGATGGCCAGGGCACTGGC

GAGGAGAGCCCGCGGCTGCAG---------------------

GAGGAAAGCTCTCGGCCGCAGGACGACGGCAGCATCCGCAAG

coding exon (translation) start coordinates: chr15: 75704394

incorrect exon end coordinates: chr15: 75704453

correct exon end coordinates: chr15: 75704474

gtgcgcagaactggcgcggc

gggggctgagcagggata

coding exon (translation) start coordinates: chr15: 75704394

genome alignment

CESAR alignment

Human

Mouse

Human

Mouse

Fig. 1 Coordinates of aligned exon boundaries do not always correspond to real exon coordi nates in another

genome. (a) Part of the genom e alignment between human and mouse that covers the ﬁrst coding exon of

RHPN1 (blue font). The human consensus donor dinucl eotide (“gt”, bold) aligns to a non-consensus donor site

(red font) in the mouse, indicating that the exon end coordinates may not corre spond to the respective mouse

exon end. (b) CESAR re-aligns this sequence and detects a consensus donor site that is shifted 21 nt

downstream. These exon end coordinates precisely correspond to the exon end in mouse. Please note that the

CESAR alignment shows a 21 nt insertion in mouse. These additional 21 exonic bases are translatable in the

same reading frame

180 Virag Sharma and Michael Hiller

[email protected]

ambiguities cause conserved exons to be missed in the resulting

gene annotation. Second, the position of splice sites of truly con-

served exons can shift during evolution [4]. Since genome align-

ments do not aim at generating an exon alignment with consensus

splice sites, the position of the projected exon boundaries in the

query genome may be incorrect for such exons.

CESAR is a method to resolve these two limitations [4, 5]. For

a given exon, CESAR uses the query sequence provided by the

genome alignment and then re-aligns this putative exonic

sequence, incorporating both information about the reading

frame and the splice sites of the reference exon. For the given

exonic sequence, CESAR aims at ﬁnding an alignment that

(1) has consensus splice sites and (2) preserves the reading frame

and thus lacks inactivating mutations such as frameshifts and

in-frame stop codons. As a result, CESAR correctly infers exon

conservation for more than 5300 exons that had a broken reading

frame or non-consensus splice sites in the genome alignment

between human and mouse, and it is able to correctly detect

>90% of evolutionary splice site shifts [4] (Fig. 1b). This leads to

an accurate comparative gene annotation, exempliﬁed by our obser-

vation that 99.1% of the human exons that CESAR projects to the

mouse genome overlap annotated mouse exons, and for 96.8% of

the projected exons both boundaries are correct. An example illus-

trating a gene annotation produced by CESAR is shown in Fig. 2.

We recently re-implemented CESAR in C (CESAR 2.0), which

drastically reduces runtime and memory consumption [5]. In con-

trast to the original implementation that only allowed to re-align a

single coding exon (referred to as “single-exon mode” in the

following), CESAR 2.0 also provides a “multi-exon mode” that

allows to re-align entire multi-exon genes at once against a locus in

the query genome. In multi-exon mode, CESAR 2.0 can detect

exons that do not align in the genome alignment (Fig. 3a), and it

can recognize intron deletion events that result in a larger compos-

ite exon in the query species. Furthermore, CESAR 2.0 improves

the ability to detect distal evolutionary splice site shifts, which

further enhances the precise identiﬁcation of exon boundaries

[5]. In the following, we describe how to use this new implemen-

tation (simply referred to CESAR in the following) to obtain

comparative gene annotations.

2 Materials

2.1 Availability

CESAR’s source code, pre-compiled binaries, and other tools

required to annotate exons in a query genome are available from

the github repository https://github.com/hillerlab/CESAR2.0.

Open a terminal in your Linux-like environment and do the

following:

Comparative Gene Annotation with CESAR 181

[email protected]

git clone https://github.com/hillerlab/CESAR2.0/

cd CESAR2.0/

2.2 Installation

Compiling CESAR’s source code (written in C) requires the gcc

compiler:

make

Alternatively, a CESAR binary, pre-compiled under Linux

64 bit is present in the precompiledBinary_x86_64 subdirectory.

To automate task of annotating exons in a query genome, it is

necessary to add the “tools” subdirectory to the $PATH variable in

your environment and to set the $proﬁlePath variable. This allows

to call the tools located in this directory without specifying the

full path.

If you are using a bash shell, do

export PATH=$PATH:‘pwd‘/tools

export profilePath=‘pwd‘

If you are using C-shell, do

setenv PATH ${PATH}:‘pwd‘/tools

setenv profilePath ‘pwd‘

2.3 Computational

Requirements

CESAR has been tested on different distributions of Linux (Cen-

tOS, Ubuntu, SUSE). The simplest way to get CESAR running is

to use the precompiled binary. For users working with a Windows

Scale 100 kb

Human Coding Exons Mapped by CESAR

Basic Gene Annotation Set from GENCODE

ASAP2

ITGB1BP1

CPSF3

IAH1

ADAM17

YWHAQ

Asap2

Itgb1bp1

Iah1

Cpsf3

Adam17

Ywhaq

Mouse genome (mm10 assembly) coordinates: chr12:21211760-21428239

Fig. 2 UCSC genom e browser screenshot showing CESAR’s gene annotation in the mouse genome. CESAR

was applied to project human exons to mouse, resulting in a gene annotation that matches the mouse

Gencode annotation. Please note that CESAR only considers coding exons. Thus, UTR exons are not projected,

as indicated by red arrows for Itgb1bp1

182 Virag Sharma and Michael Hiller

[email protected]

human (hg38): chr1:161,312,916-161,364,786 20 kb

Genome Alignment between Human and Mouse

SDHC

mouse (mm10): chr1:171,126,897-171,151,888

10 kb

Human Coding Exons Mapped by CESAR (single-exon mode)

dhc

Human Coding Exons Mapped by CESAR (multi-exon mode)

human (hg38): chr5:128,965,517-129,033,642

20 kb

SLC27A6

GT------------------------TATGAAGGAAGAGCAGGAATGGCTTCTATTATT

gacagCACCATTGAAAACTCTGTTTCCCAAA---------------GATATGGCATCA------

* ***** **

TAAAACCAAATACATCTTTAGATTTGGAAAAAGTTTATGAACAAGTTGTA---------CATTT

-----------------------TTGAAAAGAATGTATCACAAACTAATATTGAAACTCACTAC

*** *** * * *** * ** * ** *

CTACCAGCTTATGCTTGTCCACGATTTTTAAGAATTCAG

CTTGAG------------------------------CAAgtagt

** **

Genome Alignment between Human and Black flying-fox

TTGGTCTCTTCCCATGGCGATGTCCATCTGCCACCGTGGCACTGGTATTGCTTTGAGTGAG

tgtagATGGTCTCTTCCTATGGCACTGTCCGTTTGCCACCGAGGCTCTGGAATAGCCTTGAGTGAGgtatg

*********** ***** ***** * ******** *** **** ** ** *********

Fig. 3 Strength and weakness of CESAR’s multi-exon mode. (a) Multi-exon mode recovers an exon that does

not align in the genome alignment. Top: UCSC genome browser screenshot showing the human SDHC gene

and the genome alignment to mouse. The fourth exon does not align (red box). Bottom: UCSC genome browser

showing the orthologous mouse Sdhc gene and CESAR’s gene annotation obtained in single- and multi-exon

mode. In contrast to the single-exon mode, multi-exon mode detects exon 4 (red box) with its precise splice

sites, as shown by the sequence alignment underneath. (b) Multi-exon mode detects a false exon that is truly

absent. The ninth coding exon (red box) of human SLC27A6 does not align to the black ﬂying-fox genome and

this coding exon is truly deleted [9]. Other exons exhibit numerous frameshifting and stop codon mutations,

showing that this gene is inactivated in the black ﬂying-fox [9]. CESAR’s multi-exon mode neverth eless

annotates the ninth coding exon in the black ﬂying-fox; however, the sequence alignment reveals several large

insertions and deletions and a low sequence identi ty. Thus, a post-processing step can ﬁlter out such poorly

aligning exons that are unlikely to be real

Comparative Gene Annotation with CESAR 183

[email protected]

machine, a virtual machine (VMware or VirtualBox) running Linux

should be able to support CESAR, though this has not been tested.

Memory requirement is proportional to the length of the

reference and query sequence. As shown in Table 1, a desktop

machine with 32 GB of RAM is sufﬁcient to run CESAR in

single-exon mode on all human genes using the human-mouse

genome alignment. The memory requirements for CESAR’s

multi-exon mode are more demanding as intronic sequences can

be large. Still, 32 GB of RAM is sufﬁcient to re-align 99.6% of the

human genes in their entirety to the respective mouse genomic

locus. Importantly, before allocating memory, CESAR

pre-computes an upper bound of the required memory and exits

with a warning if more memory is needed than speciﬁed by the user

with the “-maxMemory” parameter (set to 16 GB by default).

2.4 Input

CESAR’s gene-annotation workﬂow requires the following data as

input:

1. The genomes of the reference and all query species.

2. Transcripts annotated in the reference genome.

3. A genome alignment between the reference and one or more

query genomes.

How to obtain each input data is described below in Subhead-

ings 3.1–3.3.

Table 1

Memory requirements for CESAR for short, typical, and very long exons or genes in both single-exon

and multi-exon mode

Reference length (bp) Query length (bp) Memory (GB) Mode

100 152 0.001 Single-exon

1,005 1,170 0.01 Single-exon

5,001 4,664 0.18 Single-exon

10,227 10,038 0.77 Single-exon

984 5,484 0.04 Multi-exon

5,004 137,114 5.72 Multi-exon

9,510 135,903 10.03 Multi-exon

17,673 19,225 2.55 Multi-exon

Multi-exon mode refers to aligning all exons of the reference to the entire query locus that contains the entire

orthologous gene

184 Virag Sharma and Michael Hiller

[email protected]

3 Annotating Genes from a Genome Alignment

3.1 Preparing

the Genome Asse mbly

Input Data

Obtain the genome sequence of both the reference and all query

species. To this end, one can download the genome as a single ﬁle in

fasta format from NCBI (https://www.ncbi.nlm.nih.gov/assem

bly), from Ensembl (https://www.ensembl.org/downloads.html)

or from the UCSC genome browser (http://hgdownload.soe.ucsc.

edu/downloads.html). Each fasta ﬁle must be converted into a 2bit

ﬁle format by using faToTwoBit from the UCSC source code

[6]. For example, if the fasta ﬁle for mouse genome is called

“mm10.fa,” the following command converts it to a 2bit ﬁle:

faToTwoBit mm10.fa mm10.2bit

Afterward, create a “2bitDir” directory. In this directory, each

species must have a subdirectory that is identical to the assembly

name (e.g. hg38 for human, mm10 for mouse, oryAfe1 for aard-

vark). An example is provided with CESAR’s source code:

find extra/miniExample/2bitDir

which lists the following files:

extra/miniExample/2bitDir

extra/miniExample/2bitDir/hg38

extra/miniExample/2bitDir/hg38/chrom.sizes

extra/miniExample/2bitDir/hg38/hg38.2bit

extra/miniExample/2bitDir/oryAfe1

extra/miniExample/2bitDir/oryAfe1/oryAfe1.2bit

extra/miniExample/2bitDir/oryAfe1/chrom.sizes

In addition, create a ﬁle called “chrom.sizes” that contains the

size of all scaffolds for each genome by using twoBitInfo from the

UCSC source code:

for file in ‘find 2bitDir -name "*.2bit"‘ ; do

d=‘dirname $file‘;

f=‘basename $file‘;

twoBitInfo $d/$f $d/chrom.sizes;

done

3.2 Preparing

the Reference Gene

Annotation Input Data

The second step in the CESAR gene-annotation workﬂow is

obtaining the set of the reference species’ transcripts of which you

wish to annotate their orthologs in the query genome(s). For

example, if the reference species is human, the human Ensembl

gene annotation can be used [7]. Ensembl transcripts can be down-

loaded from Ensembl ftp site (https://www.ensembl.org/info/

Comparative Gene Annotation with CESAR 185

[email protected]

data/ftp/index.html) by clicking on the “GTF” link under “Gene

sets” for Human. At the time of writing, Ensembl v93 genes are

available for the human GRCh38 assembly. Clicking on “Homo_-

sapiens.GRCh38.93.gtf.gz” would save the human gene set ﬁle to

the disk. Alternatively, the UCSC genome browser provides gene

annotations, which can be downloaded in gtf format from the

Table browser (http://genome.ucsc.edu/cgi-bin/hgTables)[6].

After download, transcripts in gtf format need to be converted

to genePred format using gtfToGenePred from the UCSC source

code:

# go to the directory that contains the downloaded transcripts, e.g.

cd ~/Downloads

# unzip the file, in case it is compressed

gzip -d Homo_sapiens.GRCh38.93.gtf.gz

# this produces a file called Homo_sapiens.GRCh38.93.gtf.

# Convert to genePred format

gtfToGenePred Homo_sapiens.GRCh38.93.gtf Homo_sapiens.

GRCh38.93.gp

Ensure that the generated genePred ﬁle has the right format

(see Note 1).

Next, we ﬁlter the transcripts to retain only protein-coding

transcripts. Additionally, this ﬁltering step also discards the follow-

ing problematic transcripts: (1) transcripts with a CDS length that

is not a multiple of 3 (e.g. genes that utilize programmed ribosomal

frameshifts or exhibit a polymorphism in the reference), and

(2) transcripts with micro-introns smaller than 30 bp as such

introns often occur in incorrectly annotated transcripts.

# At this stage, it is useful to specify the input file as a variable

# (here in Bash notation)

export inputGenes=Homo_sapiens.GRCh38.93.gp

formatGenePred.pl ${inputGenes} ${inputGenes}.CESAR ${input-

Genes}.ignore

Instead of considering all available coding transcripts of a gene,

one can run the gene-annotation workﬂow also with the longest

transcript only. In this case, add the “-longest” ﬂag:

formatGenePred.pl ${inputGenes} ${inputGenes}.CESAR ${input-

Genes}.ignore -longest

3.3 Preparing

the Genome Alignment

CESAR requires as input a genome alignment between the selected

reference and one or more query genomes in maf format (https://

genome.ucsc.edu/FAQ/FAQformat.html#format5). CESAR can

handle both a pairwise or a multiple genome alignment stored in

this format. Genome alignments can be downloaded from the

186 Virag Sharma and Michael Hiller

[email protected]

UCSC genome browser (http://hgdownload.soe.ucsc.edu/

downloads.html). Alternatively, genome alignments can be created

with the chaining and netting pipeline [8]. The entire process of

creating a pairwise genome alignment in maf format can be auto-

mated by the UCSC script doBlastzChainNet.pl, as described in

http://genomewiki.ucsc.edu/index.php/Whole_genome_align

ment_howto.

CESAR’s workﬂow requires that the genome alignment is

indexed by the provided mafIndex tool, which uses the chrom.

size ﬁle of the reference genome:

mafIndex ali.maf ali.bb -chromSizes=extra/miniExample/2bit-

Dir/hg38/chrom.sizes

3.4 Preparing

and Executing

the CESAR Gene

Annotation Jobs

After preparing the three types of input (genome sequences, tran-

script information and the genome alignment), the different vari-

ables that are used as inputs to the CESAR gene-annotation

workﬂow need to be deﬁned.

export reference=... # the assembly name of the reference

(e.g. hg38)

export twoBitDir=... # the directory containing the genomes

and chrom.size

# files (e.g.

extra/miniExample/2bitDir)

export alignment=... # the indexed alignment file (ali.bb

above)

export querySpecies=... # a comma-separated list of the query

species that you

# want to annotate. Each query species

must be contained

# in ${alignment}.

export outputDir=... # name of the output directory that

will contain exon

# coordinates (in subdirectories). The

directory will be

# created, if it does not exist.

export resultsDir=... # name of thedirectory that will

contain the final gene

# annotation (one gene annotation file

per query species)

export maxMemory=... # maximum amount of memory in GB that

CESAR is allowed

# to allocate

export profilePath=... # path to the directory that contains

the ’extra’

# subdirectory containing CESAR’s

profiles and matrices

Comparative Gene Annotation with CESAR 187

[email protected]

Next, we generate the gene-annotation workﬂow commands

for all ﬁltered transcripts:

for transcript in ‘cut -f1 ${inputGenes}.forCESAR‘; do

echo "annotateGenesViaCESAR.pl ${transcript} ${alignment}

${inputGenes}.forCESAR ${reference} ${querySpecies} ${output-

Dir} ${twoBitDir} ${profilePath} -maxMemory ${maxMemory}"

done > jobList

The result is a ﬁle called “jobList” in which each line consists of

a single job that re-aligns a single transcript to all query species.

Each job is completely independent of any other job. Hence, each

job can be run in parallel on a compute cluster. In the absence of a

compute cluster, the jobs can be run sequentially:

chmod +x jobList

./jobList

Using CESAR to project 196,259 human coding exons to

mouse takes approximately 7 h on a desktop machine using a single

core. The memory requirement will vary on the size of the input

gene (see Note 2 and Table 1).

3.5 Merging CESAR’s

Output Into a Single

Gene Annotation File

per Species

In this step, we collect the results obtained in the previous step

(after each job has successfully ﬁnished) in a single genePred ﬁle for

each query species.

for species in ‘echo $querySpecies | sed ’s/,/ /g’‘; do

echo "bed2GenePred.pl $species $outputDir /dev/stdout | awk

’{if ($4 != $5) print $0}’ > $resultsDir/$species.gp"

done > jobListGenePred

chmod +x jobListGenePred

./jobListGenePred

This step takes only a few minutes. The ﬁnal results are in

$resultsDir (speciﬁed as a variable above) as a single genePred-

formatted ﬁle per query species. GenePred ﬁles can be converted

to gtf format using genePredToGtf from the UCSC source code:

genePredToGtf file mm10.gp mm10.gtf

The $outputDir directory that is used to store temporary

results may be deleted afterward.

3.6 Visualizing

the Gene Annotations

This step is optional. An obtained genePred ﬁle can be visualized in

the UCSC genome browser of the query genome, as shown in

Fig. 2. This can be done by converting the genePred ﬁle into gtf

format, as described above, and then uploading this ﬁle to the

UCSC genome browser via their “Custom Track” feature.

188 Virag Sharma and Michael Hiller

[email protected]

3.7 Running CESAR

in Multi-exon Mode

To run CESAR in multi-exon mode, all the steps described above

are exactly the same (Subheadings 3.1–3.3 and 3.5) except Sub-

heading 3.4. After specifying the variables, the following command

will generate the jobs that run CESAR in multi-exon mode:

for transcript in ‘cut -f1 ${inputGenes}.forCESAR‘; do

echo "annotateGenesViaCESAR_multi_exon.pl ${transcript}

${alignment} ${inputGenes}.forCESAR ${reference} ${querySpe-

cies} ${outputDir} ${twoBitDir} ${profilePath} -maxMemory

${maxMemory}"

done > jobList

The jobs listed in “jobList” can be executed on a compute

cluster or run sequentially.

4 Notes

1. In case of problems with the transcript ﬁle, one can use UCSC’s

genePredCheck tool to check if the converted genePred has a

valid format.

2. A limitation of CESAR is that its memory requirement is

proportional to the lengths of the input sequences. By default,

CESAR stops with a warning if it estimates that more than

16 GB of memory may be required:

CRITICAL src/Cesar.c:117 main(): The memory consumption is

limited to 16.0000 GB by default. Your attempt requires

30.1539 GB. You can change the limit via --max-memory.

If your computer provides more memory, set the max-

Memory above to a higher value. For example, 32 GB of

RAM are sufﬁcient to align all human exons in single-

exon mode.

In multi-exon mode, CESAR may require even more mem-

ory in case the transcript has many exons or introns in the query

genome are large. For such genes, CESAR can be run in single-

exon mode.

5 Special Cases

1. In case exons are truly deleted or overlap an assembly gap in the

query, CESAR’s multi-exon mode has a tendency to align

random intronic sequence to such reference exons, instead of

producing an alignment where these exons are entirely deleted.

Such exon alignments are characterized by large insertions and

Comparative Gene Annotation with CESAR 189

[email protected]

deletions and a low sequence identity (Fig. 3b). A subsequent

ﬁltering step can be used to remove those exons from the

resulting gene annotation that are poorly aligning and thus

unlikely to be real exons.

2. CESAR’s multi-exon mode requires that all aligning exons of a

gene are located on a single locus in the query genome (same

scaffold and same strand in a co-linear order). It is therefore

recommended to use the single-exon mode for query assem-

blies with a high degree of fragmentation, where many genes

will partially align to different scaffolds. Alternatively, CESAR

can be run in both single- and multi-exon mode, and the

resulting annotations can be combined.

3. CESAR’s source code provides splice site proﬁles obtained for

human. These proﬁles are used in the re-alignment process to

locate orthologous or shifted splice sites. Splice site proﬁles will

be similar for closely related species such as mammals; however,

they may differ if species distantly related to human are used as

the reference. In this case, it is recommended to obtain new

splice site proﬁles for the reference species, which can be done

as follows:

# obtain a file that contains the longest transcript per gene

for the reference

formatGenePred.pl ${inputGenes} ${inputGenes}.CESAR

${inputGenes}.ignore -longest

# define the following variables

export input=${inputGenes}.CESAR

export ref_2bit=... # the path to the two bit file of

reference species

# extract the sequences around the splice sites from all

transcripts

extract_splic e_sit es.pl $input acc_se qs.txt donor_ seqs.t xt

$ref_2bit

# extract the sequence upstream of the first exon from the

genes

get_start_context.pl $input start_seqs.txt $ref_2bit

# Lastly, convert these sequences to profiles:

create_profiles.pl acc_seqs.txt acc_profile.txt

create_profiles.pl donor_seqs.txt do_profile.txt

create_profiles.pl start_seqs.txt firstCodon_profile.txt

# clean-up

rm acc_seqs.txt donor_seqs.txt start_seqs.txt

# Move these files to the relevant clade so that CESAR can

read these profiles

export clade=... # name of the new clade, for example chicken

mkdir -p CESAR2.0/extra/tables/$clade

190 Virag Sharma and Michael Hiller

[email protected]

mv acc_profile.txt CESAR2.0/extra/tables/$clade

mv do_profile.txt CESAR2.0/extra/tables/$clade

mv firstCodon_profile.txt CESAR2.0/extra/tables/$clade

# copy the original stop codon profile and the codon

substitution matrix

cp CESAR2.0/extra/tables/human/lastCodon_profile.txt

CESAR2.0/extra/tables/$clade

cp CESAR2.0/extra/tables/human/eth_codon_sub.txt

CESAR2.0/extra/tables/$clade

Acknowledgment

This work was supported by the Max Planck Society and the

German Research Foundation (HI 1423/3-1).

References

1. Picardi E, Pesole G (2010) Computational

methods for ab initio and comparative gene

ﬁnding. Methods Mol Biol 609:269–284.

https://doi.org/10.1007/978-1-60327-241-

4_16

2. Zhu J, Sanborn JZ, Diekhans M, Lowe CB,

Pringle TH, Haussler D (2007) Comparative

genomics search for losses of long-established

genes on the human lineage. PLoS Comput

Biol 3(12):e247. https://doi.org/10.1371/

journal.pcbi.0030247

3. Sharma V, Hiller M (2017) Increased alignment

sensitivity improves the usage of genome align-

ments for comparative gene annotation. Nucleic

Acids Res 45(14):8369–8377. https://doi.org/

10.1093/nar/gkx554

4. Sharma V, Elghafari A, Hiller M (2016) Coding

exon-structure aware realigner (CESAR) utilizes

genome alignments for accurate comparative

gene annotation. Nucleic Acids Res 44(11):

e103. https://doi.org/10.1093/nar/gkw210

5. Sharma V, Schwede P, Hiller M (2017) CESAR

2.0 substantially improves speed and accuracy of

comparative gene annotation. Bioinformatics 33

(24):3985–3987. https://doi.org/10.1093/

bioinformatics/btx527

6. Casper J, Zweig AS, Villarreal C, Tyner C, Speir

ML, Rosenbloom KR, Raney BJ, Lee CM, Lee

BT, Karolchik D, Hinrichs AS, Haeussler M,

Guruvadoo L, Navarro Gonzalez J, Gibson D,

Fiddes IT, Eisenhart C, Diekhans M,

Clawson H, Barber GP, Armstrong J,

Haussler D, Kuhn RM, Kent WJ (2018) The

UCSC Genome Browser database: 2018 update.

Nucleic Acids Res 46(D1):D762–D769.

https://doi.org/10.1093/nar/gkx1020

7. Zerbino DR, Achuthan P, Akanni W, Amode

MR, Barrell D, Bhai J, Billis K, Cummins C,

Gall A, Giron CG, Gil L, Gordon L,

Haggerty L, Haskell E, Hourlier T, Izuogu

OG, Janacek SH, Juettemann T, To JK, Laird

MR, Lavidas I, Liu Z, Loveland JE, Maurel T,

McLaren W, Moore B, Mudge J, Murphy DN,

Newman V, Nuhn M, Ogeh D, Ong CK,

Parker A, Patricio M, Riat HS,

Schuilenburg H, Sheppard D, Sparrow H,

Taylor K, Thormann A, Vullo A, Walts B,

Zadissa A, Frankish A, Hunt SE, Kostadima M,

Langridge N, Martin FJ, Muffato M, Perry E,

Rufﬁer M, Staines DM, Trevanion SJ, Aken BL,

Cunningham F, Yates A, Flicek P (2018)

Ensembl 2018. Nucleic Acids Res 46(D1):

D754–D761. https://doi.org/10.1093/nar/

gkx1098

8. Kent WJ, Baertsch R, Hinrichs A, Miller W,

Haussler D (2003) Evolution’s cauldron: dupli-

cation, deletion, and rearrangement in the

mouse and human genomes. Proc Natl Acad

Sci U S A 100(20):11484–11489. https://doi.

org/10.1073/pnas.1932072100

9. Sharma V, Hecker N, Roscito JG, Foerster L,

Langer BE, Hiller M (2018) A genomics

approach reveals insights into the importance

of gene losses for mammalian adaptations. Nat

Commun 9(1):1215. https://doi.org/10.

1038/s41467-018-03667-1

Comparative Gene Annotation with CESAR 191

[email protected]