- PHYlogenetic REconstruction by use of gene eXpression and alternate mRNA splicing patterns
As genomes evolve under selective pressure, not only do protein coding
sequences evolve, but the gene regulatory sites affecting expression
profiles and alternative splice site usage also evolve. To understand
the evolutionary history of gene and genome functionality and the
selective pressures that have affected the evolution of genomes, it is
desirable to reconstruct the ancestral state of gene expression and
splice site usage. At a first level, we have developed a Minimum
Evolution approach based on the use of gene expression profiles. The
approach is implemented to work with large scale datasets like
microarrays, e.g. the Affymetrix genechip technology, and is correlated
with an analysis of the actual regulatory sequence reconstruction done
by similar methods, where such information is available.
Our objectives with this research is to highlight the changes made by
selective pressure by use of Minimum evolution methods to reconstruct
continous data at the ancestral states in a phylogenetic tree.
To reconstruct the ancestral states of the continuous data traits I
have developed a brute force algorithm that constructs an interval of
allowed values on each internal node in a phylogenetic tree (Schreiber
format), and chooses the best value to represent each node. The
algorithm runs trough the tree two times, hence an order O(2n) time
complexity. In the first run the intervals of allowed values are
constructed and in the second run the intervals are narrowed and the
representing value is chosen. This is done for every gene represented
in our large scale data set.
An organism can by accumulating substitutions divide into two closely
related organisms. By comparing sequences from a set of homologous
genes from different species it is possible to find the sequence of
their closest common ancestor. Algorithms that do such calculation on
sequences have allready been developed. In this thesis ClustalW and
BaseML are used. ClustalW is used to align sequences found in the leaf
nodes of the tree and BaseML is used to calculate sequences at the
ancestral nodes of the phylogenetic tree based on the alignment done by
ClustalW. The sequences used are the upstream regions of the genes
collected from EnsEmbl. Our work has
also produced an algorithm and methods for constructing ancestral gene
expression profiles. and a framework that can be used to display and
compare them to the sequence calculation done as described above.
Simultaneous reconstruction of regulatory sequences and expression
profiles reduces the signal to noice ratio by using long branched
significant classes to correlate substitutions with functional effects.
By comparing the two calculations mentioned above and by isolating
candidate genes where a clear change in gene expression is correlated
with a high number of point mutations in the upstream region, the
signal to noice ratio is reduced.
The program is developed as a part of my master thesis at the
University of Bergen.
Best Regards
Roald Rossnes