BACKGROUND

Friday, 03 December 2004

As genomes evolve under selective pressure, not only do protein coding sequences evolve, but the gene regulatory sites affecting expression profiles and alternative splice site usage also evolve. To understand the evolutionary history of gene and genome functionality and the selective pressures that have affected the evolution of genomes, it is desirable to reconstruct the ancestral state of gene expression and splice site usage. At a first level, we have developed a Minimum Evolution approach based on the use of gene expression profiles. The approach is implemented to work with large scale datasets like microarrays, e.g. the Affymetrix genechip technology, and is correlated with an analysis of the actual regulatory sequence reconstruction done by similar methods, where such information is available.

Our objectives with this research is to highlight the changes made by selective pressure by use of Minimum evolution methods to reconstruct continous data at the ancestral states in a phylogenetic tree.

To reconstruct the ancestral states of the continuous data traits I have developed a brute force algorithm that constructs an interval of allowed values on each internal node in a phylogenetic tree (Schreiber format), and chooses the best value to represent each node. The algorithm runs trough the tree two times, hence an order O(2n) time complexity. In the first run the intervals of allowed values are constructed and in the second run the intervals are narrowed and the representing value is chosen. This is done for every gene represented in our large scale data set.

An organism can by accumulating substitutions divide into two closely related organisms. By comparing sequences from a set of homologous genes from different species it is possible to find the sequence of their closest common ancestor. Algorithms that do such calculation on sequences have allready been developed. In this thesis ClustalW and BaseML are used. ClustalW is used to align sequences found in the leaf nodes of the tree and BaseML is used to calculate sequences at the ancestral nodes of the phylogenetic tree based on the alignment done by ClustalW. The sequences used are the upstream regions of the genes collected from EnsEmbl. Our work has also produced an algorithm and methods for constructing ancestral gene expression profiles. and a framework that can be used to display and compare them to the sequence calculation done as described above.

Simultaneous reconstruction of regulatory sequences and expression profiles reduces the signal to noice ratio by using long branched significant classes to correlate substitutions with functional effects. By comparing the two calculations mentioned above and by isolating candidate genes where a clear change in gene expression is correlated with a high number of point mutations in the upstream region, the signal to noice ratio is reduced.

The program is developed as a part of my master thesis at the University of Bergen.

Best Regards

Roald Rossnes