- PHYlogenetic REconstruction by use of gene eXpression and alternate mRNA splicing patterns
Roald Rossnes
The writing of this thesis has been guided by my advisers Ingvar Eidhammer and David Liberles, with whom I have had many interesting and helpful discussions. I would also like to thank Bjarte Dysvik, Gisle Sælensminde and Saywan Pasha who have helped me with problems I have encountered, and have allowed me to use and modify some of their components in my thesis. A special thanks to Tim Hughes for valueable comments. Finally, I direct my thanks to my family and my fiance, Ann - Kristin, for showing encouragement, support and a touch of indifference.
The thesis is a part of my Cand. Scient degree at the University of Bergen. In this thesis I have studied the relations between changes in gene expression and mutations in the upstream regions of a gene.
The thesis study these changes on a genome wide basis by applying Minimum Evolution algorithms and known sequence reconstruction algorithms on large scale data sets like the AffyMetrix GeneChip.
To meassure the impact selective preassure have on the results the analysis is placed in a evolutionary perspective by reconstructing gene expression profiles and sequence data on known phylogenies.
As seen from the data sources the analysis can be divided into reconstructing gene expression profiles, reconstructing sequence data and correlating and analyse the results. Gene expression profiles is reconstructed using a Minimum Evolution framework whereas sequence data is reconstructed by the use of ClustalW and BaseML.
After reconstruction is done candidates for further analysis is picked out by applying different filters on the results. This results in two groups. A controll group and a candidate group. These groups are further analysed by external resources like e.g. TESS.
algorithmList of Algorithms
Evolution is a slow and still ongoing process. The theory of evolution describes the relationships between different species and how they have changed and developed from simpler organisms further back in time. The first theory that covered this was presented by Darwin in the 18th century in his famous ``On the Origin of the Species by Means of Natural Selection'' [6]. The theory describes what is called natural selection. Individuals in a population that are better adapted to the environment they are living in will have a higher survival rate and reproduction rate and thus will pass their characteristics on to the next generation. Another important theory was presented by Kimura [19], and describes how evolution can be driven by a set of random changes in the genomes. This theory is called neutral evolution. More recently the adaptive evolution theory has been developed. This theory is a combination of Darwinism and neutral selection, and suggest that evolution consists of both natural selection and random changes.
``Genome'' is a relative new term in molecular biology. The term desribes all the genetic material in the chromosomes of a particular organism, the size of a genome is often given as the total number of basepairs. Traditional molecular biology research was often limited to ``one gene one researcher'' but with the introduction of rapid sequencing methods and microarrays whole genome sequencing and analysis became possible. Figure 1.1 shows a phylogenetic tree with sequence properties on the leaf nodes. Phylogenetic trees can be built based on many different types of data, e.g. gene expression, sequences, alternative splice site usage and so on. Data on leaf nodes corresponds to different organisms
Phylogenetic methods are used to reconstruct trees, like the one shown above. By using such methods one can reconstruct relationships between the species of today, and infer data about ancestrall species. This can be done in several ways, but the most common way is to use DNA and mathematical or statistical models for infering the ancestral states. In the figure above, the sequences are reconstructed giving a theoretical view of what the sequences in the ancestral nodes may have looked like.
This thesis uses all the above mentioned methods. The tasks of the thesis is to reconstruct or estimate the values of the ancestral nodes in a given phylogenetic tree, based on values from the leaf nodes, both for sequence data and gene expression profiles, and correlating the level of changes of the two data types, branch by branch.
As genomes evolve under selective pressure, not only do protein coding sequences evolve, but the gene regulatory sites affecting expression profiles and alternative splice site usage also evolve. To understand the evolutionary history of gene and genome functionality and the selective pressures that have affected the evolution of genomes, it is desirable to reconstruct the ancestral state of gene expression and splice site usage. At a first level, I have developed a Minimum Evolution approach based on the use of gene expression profiles. The approach is implemented to work with large scale datasets like microarrays, e.g. the Affymetrix genechip technology, and is correlated with an analysis of the actual regulatory sequence reconstruction done by similar methods, where such information is available.
My objectives with this research is to highlight the changes made by selective pressure by use of Minimum evolution methods to reconstruct continous data at the ancestral states in a phylogenetic tree.
To reconstruct the ancestral states of the continuous data traits I have developed a brute force algorithm that constructs an interval of allowed values on each internal node in a phylogenetic tree (Schreiber format), and chooses the best value to represent each node. The algorithm runs trough the tree two times, hence an order O(2n) time complexity. In the first run the intervals of allowed values are constructed and in the second run the intervals are narrowed and the representing value is chosen. This is done for every gene represented in the large scale data set.
An organism can by accumulating substitutions divide into two closely related organisms. By comparing sequences from a set of homologous genes from different species it is possible to find the sequence of their closest common ancestor. Algorithms that do such calculation on sequences have allready been developed. In this thesis ClustalW and BaseML are used. ClustalW is used to align sequences found in the leaf nodes of the tree and BaseML is used to calculate sequences at the ancestral nodes of the phylogenetic tree based on the alignment done by ClustalW. The sequences used are the upstream regions of the genes collected from EnsEmbl. Our work has also produced an algorithm and methods for constructing ancestral gene expression profiles. and a framework that can be used to display and compare them to the sequence calculation done as described above.
Simultaneous reconstruction of regulatory sequences and expression profiles reduces the signal to noice ratio by using long branched significant classes to correlate substitutions with functional effects. By comparing the two calculations mentioned above and by isolating candidate genes where a clear change in gene expression is correlated with a high number of point mutations in the upstream region, the signal to noice ratio is reduced.
This project belongs to the Liberles research group which is a part of CBU - Computational Biology Unit at the University of Bergen. The Liberles group is looking at adaptive evolution from different perspectives. This is done to get an understanding of the evolution of new functionality in genomes and how this is correlated with the phenotypic divergence of the species. Important genomic events that are studied in this process are horizontal transfer, gene duplication, sequence divergence, gene expression divergence, mRNA splicing pattern divergence, and many other mechanisms. All these events are colected and analysed in a phylogenetic perspective. The research goals and methods are closer described at the CBU webpage [5].
This involves methods development, both at the DNA and protein sequence levels using parsimony and maximum likelihood methodologies. This project uses gene expression divergence and sequence divergence, and develops methods for infering ancestral states in a given phylogenetic tree for both data types.
Chapter 2 provides a description of some biological concepts that the thesis build upon. Microarrays, chapter 3, and Phylogeny, chapter 4, are both considered to be so important for this thesis that they each are covered in separate chapters. Chapter 5 is dedicated to an overview of the techniques and methods used, and describe some of the concepts behind them, in the same chapter you will find a description of Java and some of the specialized libraries used in the thesis. Chapters 6 and 7 are describing the system and the implementation, first on a non detailed level (6) before describing implementation details etc. (7). Chapters 8 and 9 presents the results found using the program on a dataset collected from Enard Et.al. [8], and the conclusions that were drawn from these results.
Bioinformatics is a rapidly growing and relative new branch in computer science, and can be described as a merger between biology and computer science. The main target of the field is to make easier the discovery of new biological insight. Examples of such activities are the development of new algorithms, systems for high troughput data analysis and systems for easy access to large datasets. Some of the reasons why bioinformatics is such a rapid growing field are:
The structure of DNA was first deduced in 1953 by Watson and Crick [33]. This discovery has lead to the genome sequencing projects we see today. These projects provide an enormous amount of data to be analysed. In this section, I will give a presentation of some of the topics that are studied in Molecular biology. Molecular biology is the study of the gene structure and function at the molecular level. The purpose of this chapter is to give a brief introduction to some of the terms, elements and methods used in molecular biology. To provide the reader with some insight into biology and enable him to understand the biological aspects of this project.
Living organisms are divided into two groups : Eukaryotes which have a nucleus within their cells and prokaryotes which do not. Because of the nature of this thesis almost all the information found in this chapter will be about eukaryotes.
The basic elements in molecular biology are the biological macromolecules :
``Deoxyribonucleic acid (DNA) is the primary chemical component of chromosomes and is the material of which genes are made. It is sometimes called the "molecule of heredity," because parents transmit copied portions of their own DNA to offspring during reproduction, and because they propagate their traits by doing so''.This definition comes from Wikipedia - the free encyclopedia [37].
As the above definition indicates, DNA is the main information carrier in the cell. It contains all the data of the organism. A DNA molecule consists of a double stranded helix made by the four nucleotides Adenin (A), Thymin (T), Cytocin (C) and Guanin (G). Each strand is a chain held together by phospho - diester bonds, and the two strands are held together by hydrogen bonds. The basepairs are made up of hydrogen bonds between A and T (double) or C and G (triple). The DNA molecule is directional , due to the asymmetrical structure of the sugar that makes the skeleton of the molecule. This means that the sugar is connected to the five prime end in the fifth carbon and to the three prime end in the third carbon. The directions of the two complementary DNA strands run in opposite directions to one another. Section 2.1.4 gives a brief overview of how DNA works in a cell.
A gene is a region of DNA that controls a characteristic of an organism, usually corresponding to a single mRNA carrying the information for constructing a protein. It contains one or more regulatory sequences that either increase or decrease the rate of its transcription, based for example on environmental influence. A gene is made up of several coding regions, called exons and non-coding regions called introns. Each chromosome contains a large number of genes and one cell may contain many chromosomes. However, in any given cell only a small subset of the genes is expressed.
There are many different forms of RNA. The two that I will briefly describe here are, mRNA and tRNA, which are used in two different places in what is called the central dogma in molecular biology.
Messenger (or mRNA) is a copy of the information carried by a gene on the DNA. The role of mRNA is to move the information contained in DNA to the translation machinery. It is transcribed as RNA in the nucleus. The RNA contains introns and exons which are removed by RNA splicing leaving the exons joined together. In some cases, individual nucleotides can be added in the middle of the mRNA sequence by a process called RNA editing. mRNA are ``never'' found free in the cell. Like DNA, it is bound by proteins. These complexes are termed ribonucleoproteins or RNPs. The variability in sequence and structure means that no structure has been determined for a mRNA.
tRNA is the information adapter molecule. It is the direct interface between amino-acid sequence of a protein and the information in DNA. It decodes the information in DNA. All tRNA's from all organisms have a similar structure, indeed a human tRNA can function in yeast cells. See also section 2.1.4.
Proteins are constructed of one or more unbranched chains of amino acids. A typical protein concists of 200 - 300 amino acids, but there are much smaller (10 amino acids long) and much larger (27.000 amino acids long) proteins as well. Proteins are essential to every function in the living cell. They perform everything from maintainace of the cell to specific functions like catalysis and transport. Proteins determine the shape and structure of the cell. Proteins have a a complex structure which often is divided into four main levels. The primary structure consists of the amino acid sequence of the protein chain. In different regions in the protein chain one can find local secondary structure elements such as alpha-helices or beta-sheets. Tertiary structure is the three dimensional domains that the alpha-helices and beta-sheets form when they are connected. Finally, the protein can be constructed out of several protein domains, the arrangement of theses domains is called the quaternary structure. The entire complex structure is defined by the primary sequence and the interactions between the amino acids. The folding structure of a protein is defined to be the three dimensional structure with the minimal free energy.
The study of proteins is an entire field of research containing e.g. function prediction and structure predictions. Since proteins and the interactions between them are not the topic of this thesis, they will not be described in any more detail.
The central dogma of molecular biology is based on the principle that the flow of genetic information travels from DNA to RNA by transcription and from RNA to protein by translation. DNA can replicate in the nucleus of a cell by using one strand of the double helix as a template (this is true for eukaryotes, some organisms such as bacteria have no nucleus and as a result their DNA replicates in the cytoplasm of the single-celled organism). mRNA (messenger RNA) is produced from DNA in the process called transcription which is described below. The structure of RNA is similar to that of DNA except that RNA is single stranded unit with Uracil replacing the DNA counterpart Thymine. The mRNA is then transported out of the nucleus and into the cytoplasm of eukaryotes were proteins are formed through the translation process. A series of 3 nucleotides, or codon, codes for one amino acid. The amino acids are carried to the site of translation by transfer RNA (tRNA) and assembled in the order specified by the mRNA. Long chains of these 20 different amino acids form proteins.This dogma explains the connection between the different macromolecules and the way they correlate. To explain the central dogma one needs to define three different terms; translation, splicing and transcription.
A DNA segment that constitutes a gene is read and transcribed into a single stranded sequence of RNA. This process is catalyzed by the enzyme RNA polymerase. Near most of the genes lies a special DNA pattern called promoter, located upstream of the transcription start site. The promoter informs RNA polymerase where to start the trancription. In eucaryotic organisms transcription includes the non - coding regions (introns).
Before the RNA leaves the nucleus all the introns are removed from the sequence, this process is called splicing. The pattern of the splicing can vary depending on for instance the tissue where the splicing occurs. The variation that occurs from this is called alternative splicing, and contributes to the protein diversity in the organism. After transcription the RNA moves from the nucleus into the cytoplasm where translation will occur.
The RNA sequence is translated into a sequence of amino acids as the protein is formed. During translation, the ribosome reads three bases (a codon) at a time from the RNA and translates them into one amino acid using what is called the genetic code, figure 2.3 describes this mapping. To carry out the translation process transfer - RNA is used. tRNAs are a set of small RNA molecules that are ``designed '' in such a fashion that it in one end holds an anticodon, and on the other side the appropriate amino acid. For example if one should translate the codon UUU from a mRNA strand, one would use the tRNA molecule that concists of the anticodon AAA in one end and phenyl - alanin in the other end. tRNA works together with more than 50 different proteins, that make up the complex (ribosome), to translate the mRNA to Proteins
Each cell in an eukaryotic organism contains a large amount of genes. All genes are present in every cell in the organism, meaning that each human cell contains aproximately 20,000 - 25,000 genes. Some of these genes are expressed in all cells all the time. These are so-called housekeeping genes that are responsible for the routine metabolic functions common to all cells. Others again are expressed only in a particular tissue or when a specific hormone is present or when a specific condition is fulfilled.
An important and interesting question in biology is how gene expression is switched on and off, how genes are regulated. Since all cells in a particular organism have an identical genome, it is differences in gene expression and not the genome content that are responsible for cell differentiation during the life of the organism. Levels of expression can be controlled in different ways :
Figure 2.4 shows a typical structure of a gene and the upstream region, it also points out typical promoter / regulators like the TATA - box and a range of upstream promoters. |
A coding gene consists of several elements, see Figure 2.4 :
The transcription start site is the place in the sequence where the POL II molecule binds. This is a complex molecule consisting of 12 different proteins that serve as the start of translation. Further usptream from the POL II molecule one can find the basal promoter. This is in eukaryotes a sequence of T's and A's approximately 7 characters long (TATAAAA), often called a TATA - box. The TATA - box is found in all protein coding genes and is bound by a large complex of proteins. The upstream promoters are more sequence specific and differ from gene to gene.
Many different genes and many different types of cells share the same transcription factors - not only those that bind at the basal promoter but even some of those that bind upstream. What turns on a particular gene in a particular cell is probably the unique combination of promoter sites and the transcription factors that are chosen.
Last in this section I will mention the enhancers and silencers, these are regions in the DNA that are positioned up to thousands of basepairs away from the gene they control. As the names imply they play a role in the up or down regulation of the gene. These DNA sites can exist upstream, downstream or even inside the gene they control. A more torough explanation of gene regulation and promoter regions can be found in [23].
There are some biological processes, terms and correlations that play an inportant role when building phylogenies. I will give a short description of some of them in this section.
All characters of an organism are stored in the genetic information carried by the DNA, any change in these characters is due to a change in the DNA molecules. The changes can be divided into four main categories, substitution, deletion, insertion and inversion and are all explained in the following figure.
In Figure 2.3 we saw that there are many codons that can code for the same amino acid, this means that not all changes in a protein coding sequence will have any effect on the result. These changes are called silent or neutral substitutions. Multiple substitutions add a great deal of complexity to the construction of phylogenies. As seen from Figure 2.6 a branch of an evolutionary tree can have one or several mutations in one site, the substitution from A to T in the rightmost tree on the figure is very hard to calculate, and are often refered to as a silent mutation, not to be mixed up with the definition of silent or neutral mutations stated earlier.
It is the genes of an organism and the product of these genes that give a living organism the functions and properties it contains. However, traditional methods in molecular biology generally work on a "one gene in one experiment" basis, which means that the throughput is very limited and the "whole picture" of gene expression/function is hard to obtain. With the introduction of microarray experiments this ``one gene - one experiment'' approach is challenged with higher troughput. The technology provides opportunities to monitor almost the whole genome on a single chip so that researchers get a better picture of the interactions among thousands of genes simultaneously.
Functional genomics is the study of the functionality of specific genes, their relation to diseases, their associated proteins and their participation in biological processes. Thus microarray technology is an key technology in functional genomics.
Microarray technology is based on base-pairing (i.e., A-T/U and G-C)
or hybridization between the probes
and the targets
.
An array is an ordered arrangement of samples. It provides a method
for matching known sequences (probes) against unknown DNA samples
(targets) based on the above mentioned base-pairing rules. This automates
the process of identifying the unknowns sequences. Arrays can be divided
into macroarrays or microarrays, based on the difference in the size
of the sample spots. Macroarrays contain sample spot sizes of about
300 microns or larger.
Microarrays have typically a spot size of less than 200 microns in diameter and usually contains thousands of spots. Microarrays require specialized robotics and imaging equipment. DNA microarray, or DNA chips are manufactured by robotic instruments on small glass or nylon substrate, for which probes3.1 with known identity are used to determine complementary bindings. An experiment with a single DNA chip can provide researchers with information on thousands of genes simultaneously, the two main applications of microarray experiments are :
Divided by the properties of the DNA sequence hybridized onto the array slides one can categorize microarrays into two categories :
cDNA arrays were developed at Stanford University and are described as glass slides on which cDNA 3.2 has been deposed by robotic printing.
Each probe on a cDNA array is approximately 500 5000 bases long it is immobilized to a solid surface such as glass and exposed to a set of targets coming from a mixture of two different sources (e.g. control and sample or healthy and ill) which is fluorized with different dyes. The fluorescence signal from each probe is evaluated independently and then used to calculate the expression ratio between the samples in the mixture. This is often called a two channel analysis.
Oligonucleotide arrays are made up of oligonucleotide, 20 80 - mer, oligos that are synthesized as probes either directly on the chip or by conventional synthesis followed by on-chip immobilization. The array is exposed to labeled sample DNA, hybridized, and the identity or abundance of complementary sequences are determined. This kind of chip is often refered to as a DNA chip and was developed at Affymetrix, Inc. , which has copyrighted the technology and sells it under the GeneChip trademark.
Since the program described in this thesis is used to analyse data from AffyMetrix GeneChips the rest of this chapter will be used to introduce in more detail the manufacturing and analysing methods used in oligonucleotide microarray experiments can be found.. For a more thorough presentation of cDNA arrays I refer the reader to [4].
Oligonucleotide probe arrays are manufactured through a process of photo-lithography and combinatorial chemistry. GeneChip technology produces arrays with hundreds of thousands of different probes packed at a high density. The manufacture is scalable because the length of the probes, not their number, determines the number of synthesis steps required. This process yields arrays with highly reproducible properties.
Using technologies adapted from the semiconductor industry, GeneChip manufacturing begins with a 5-inch square quartz wafer. The wafer is placed in a bath of silane, which reacts with the hydroxyl groups of the quartz, and forms a matrix of covalently linked molecules. The distance between these silane molecules determines the probes packing density, allowing arrays to hold over 500,000 probe locations, or features, within 1.28 square centimeters.
Probe synthesis occurs in parallel, resulting in the addition of an A, C, T, or G nucleotide to multiple growing chains simultaneously. To define which oligonucleotide chains will receive a nucleotide in each step, photolithographic masks, carrying 18 to 20 square micron windows that correspond to the dimensions of individual features, are placed over the coated wafer. The windows are distributed over the mask based on the desired sequence of each probe. When ultraviolet light shines over the mask in the first step of synthesis, the exposed linkers become deprotected and are available for nucleotide coupling. Critical to this step is the precise alignment of the mask with the wafer before each synthesis step. A solution containing a single type of deoxynucleotide with a removable protection group is then flushed over the wafer's surface. The nucleotide attaches to the activated linkers, initiating the synthesis process. Although the process is highly efficient, some activated molecules fail to attach the new nucleotide. To prevent these "outliers" from becoming probes with missing nucleotides, a capping step is used to truncate them. In the following synthesis step, another mask is placed over the wafer to allow the next round of deprotection and coupling. The process is repeated until the probes reach their full length, usually 25 nucleotides.
Once the synthesis is complete a series of quality control tests and sampling of the different arrays are performed to ensure the quality of the arrays.
After hybridization, slides are scanned and images for the different slides are generated. These images must then be analysed to identify the arrayed spots and to measure the relative fluorescence intensities for each element. Most commercially available microarray manufacturers provide software that handles image processing. Image processing involves identifying the spots and separating them from the noise introduced by pollution in the samples, errors in the scanning or printing. After the image is interpreted by for example grid or boundary technologies the meassurement of the spots can be performed. Expression measurements consist of a range of different tasks :
There are at least two widely used techniques that can be used to normalize gene-expression data. Both assume that all (or most) of the genes in the array, some subset of genes, or a set of exogenous controls that have been 'spiked' into the RNA before labelling, should have an average expression ratio equal to one. The normalization factor is then used to adjust the data to compensate for experimental variability. In Affymetrix GeneChips, this is done by constructing quality control probes which are placed in different locations on the array. These qc-probes are used to give a meassure of ``wrong'' bindings in the experiment. The following methods are used for spotted DNA microarrays but could easily be changed to work for AffyMetrix GeneChips as well.
The normalization done in PHYREX is described in detail in Section 8.2.1.
Total intensity normalization data relies on the assumption that the quantity of initial mRNA is the same for both samples. Furthermore, one assumes that some genes are up-regulated in the query sample relative to the control and that others are down-regulated. For the thousands of genes in the array, these changes should balance out so that the total quantity of RNA hybridizing to the array from each sample is the same. Consequently, the total integrated intensity computed for all the elements in the array should be the same in both channels. Under this assumption, a normalization factor can be calculated and used to re-scale the intensity for each gene in the array.
For mRNA derived from closely related samples, a significant fraction of the genes would be expected to be expressed at similar levels. By plotting the two samples intensities against each other, these genes would cluster along a straight line, the slope of which would be one if the labelling and detection efficiencies were the same for both samples. Normalization of these data is equivalent to calculating the slope that best fits using regression techniques and adjusting the intensities so that the calculated slope is one.
The statistics differ from each type of microarray experiment used. In this thesis AffyMetrix GeneChip is used and the statistics involved in reading probe values can be found in the Statistical Algorithms Description Document provided on the AffyMetrix website. The reader is referred to the SADD document for a more thorough description.
Affymetrix is a comercial company. This can have both advantages and disadvantages. The University of Bergen have no conections with the company giving rise to some disadvantages that have occured during my work with this thesis. Here are some of the problems encountered and an explanation of how this affected the choices I made when implementing the system.
Since the University have no connections to the company, getting help and answers was not easy. I actually have not receivedt answers to any of the question I e-mailed them. Finding information on how the experiments is performed and how the arrays are developed was also challenging. This made it harder to understand what information one could find in the experiments and how to collect it.
According to the end user agreement available at www.affymetrix.com, it is not allowed to use spiders, robots or scripts to contact the website or any of the products made available through the web site. Because of this manual operation by the user is necessary to keep my program up to date. Without such a statement in the contract it would be possible to make the program collect the information needed automatically.
In this section, some of the theory behind datamining from AffyMetrix GeneChips is presented, to give the reader some background information on the use of the ImageAnalysis package in the program. To perform the analysis the software was designed for, one has to structure the microarray gene expressions in a data structure. This is often a two dimentional array with the probes / genes along one axis and the samples / arrays along the other axis. In the software implementation the gene expressions from each sample is stored in a NodeObject class and not in such a data structure, but the structure is used to describe the steps in the data collection process.
As stated in a previous section AffyMetrix GeneChip technologies makes
use of a Perfect Match (PM) and a MisMatch (MM) pair. The PM feature
is a perfect match to the sequence of hybridisation, but the MM feature
has a single mismatch nucleotide in the middle of the strand. As a
measure of the gene expression AffyMetrix statistics present an average
difference (AD) :
where N is the number of feature pairs for a particular gene. A problem with this measure is the possibility of getting negative values.
This can be corrected by changing the formula to :
Where w
is a weight assigned to each pair of features.
To prevent the negative values one could assign a weight of one to
pairs where PM > MM and a weight of 0 to pairs where PM < MM. This
is done to eliminate probes with a greater hybridization to the mismatch
than the perfect match. The statistics behind collecting data from
the AffyMetrix GeneChips is much more complex than this short explanation
can present, and the interested reader may refer to the SADD - Statistical
Algorithm Description Document [2].
Figure 3.4 is aimed to give a better understanding of how a GeneChip array is designed. A GeneChip probe array consists of a number of probe cells. Probes are tiled in probe pairs as a Perfect Match (PM) and a Mismatch (MM). The sequence for PM and MM are the same, except for a change to the Watson-Crick complement in the middle of the MM probe sequence. A probe set consists of a series of probe pairs and represents an expressed transcript. |
Determining the evolutionary lineage of organisms is a major problem in the study of evolution. In the past, phylogeny was inferred by examining morphological characters. Now phylogenies are built based on the assumption that sequences from different species have descended from an ancestral gene in an ancestral species. The divergence between two genes can therefore be explained as the result of speciation event and the genes are essentially the same in all the species examined, they are orthologs. Gene duplication is another form of evolution, and consists of genes which have diverged from a common ancestor by a duplication event. These genes are called paralog, and are different genes in the same organism.
Phylogeny is a set of methods used to estimate the time of divergence between two organisms or to reconstruct the evolutionary relationship between species. This is done by using a set of methods, rules and asumptions, like for instance the molecular clock theory. To visualize the evolutionary relationship between the species it is most useful to construct a phylogenetic tree. Phylogeny is in other words about tracing evolution backwards by aligning different characters from different present living organisms.
A phylogenetic tree is a tree showing the evolutionary relationships between various species or other entities that are believed to have a common ancestor. In a phylogenetic tree, each node with it's descendants represent the most common ancestor of the descendants, and the edge lengths correspond to time estimates of when the descendants diverged from each other. Many different methods are used to build trees, the ones used in this thesis, are described below.
Phylogeny can be approched in many different ways. In this thesis, I take advantage of the constantly growing supply of completely sequenced genomes and the common understanding of some phylogenetic relations that follows from this sequencing. This means that the methods used in the thesis does not build a phylogenetic tree, but calculate changes in expression and sequence along the edges of an already well established and given species tree.
As the sequencing of DNA has become fast end easy the sequence databases grow. These databases give an opportunity to study the relationship between the sequences in an evolutionary context. Research in this area will give insight into the processes of evolution. Molecular evolution is the study of the evolution of macromolecules. Since the information regarding the development of an organism is stored in the DNA molecules, the mechanism for evolution is substitutions in the DNA. DNA is inherited from the parent organism mostly unchanged, but mutations happens over time, at a fairly constant rate, leading to the molecular clock theory. There are an enourmous amount of possible mutations but most of them are almost never seen in any population since natural selection filters them away because of the different rates of reproduction and by choosing the advantageous genes. For this reason molecular evolution is dominated by silent or neutral mutations. Several different substitution models can be used to calculate the evolutionary divergence.
A simple model for the evolutionary divergence between two sequences
is to measure the proportion of different nucleotides in the aligned
sequences. This can be estimated by:
The figure and explanations described in this chapter are all collected from the website of Andy Vierstraete [32] with cortesy of the author.
Phylogenetic trees are of two different types; gene trees and species trees. A tree which is built based on sequence data is called a gene tree since it is a representation of the evolutionary history of genes. A tree illustrating the the evolutionary history of organisms is called a species tree.
Phylogenetic methods differ both in the type of input and in the actual tree building method. The input data falls into two categories: discrete characters and continous states such as distances or in the case of this thesis gene expression profiles. Discrete characters are for example nucleotide sequences. Continous characters can be measured in different ways, the most common is either to divide them into intervals and make discrete characters represent each region or to meassure them directly such as in microarray experiments.
The two most common criterias to divide tree building methods by is the type of data they can be used on, as presented above, and the criterias or algorithms they build upon. Treebuilding methods often build upon either a clustering algorithm or an algorithm that maximises an optimality criteria.
The details of clustering algorithms can vary but the basic principle is always to group together the closest groups by some kind of distance meassure [4]. The basic principle of optimality criteria is to find the order to group items that make the least amount of change.
As seen in Figure 4.2 there are several techniques used for building phyologenetic trees, most of them produce bifurcated trees since bifurcating trees are what makes most sense in an evolutionary context. Figure 4.2 divides the building methods into three groups based on two criterias, type of data and tree building method. The three main approaches are (i) distance based methods, which work from paiwise distances between sequences e.g. PGMA and Neighbour-Joining, (ii) Minimum Evolution methods which works on distances and (iii) sequence based methods like maximum parsimony and maximum likelihood, which work directly on the multiple alignment of sequences.
Early, phylogenetic inference were made based on the morphology of the organisms. This is what is known as the Hennig's method [13] for inferring phylogenetic tree.
Maximum parsimony uses the following criteria :
``One should not increase, beyond what is necessary, the number of entities required to explain anything''This principle is called Occam's razor. The central idea of maximum parsimony is that the preferred evolutionary tree is the one which requires fewest evolutionary changes to explain the differences between the sequences under study. This means that sequences are assigned to the ancestral nodes in such a way that the total amount of substitutions is minimized. This can be done in unweighted parsimony, by assigning a cost of 1 to each substitution and a cost of 0 to each pair of identical nucleotides, or in weighted parsimony, by giving different weights to different types of substitution. A more detailed description of parsimony approaches can be found in [22].
Maximum Likelihood is a probabilistic method which assigns a likelihood
to a given tree and therefore allow the selection of the tree which
is the most likely given the observed data/sequences. The probability
for a residue a to change to b in time t along a branch of a tree
is given by :
In this thesis I have used a Minimum Evolution algorithm to calculate the gene expression changes over a given tree. This method is described in more detail in section 4.4. In addition the basic principles of the method and algorithm used in this thesis can be found in Section 6.2.
There are several different distance based methods, I will introduce
three of them in this section. Both PGMA and Neighbour Joining deserves
to be mentioned in the context of phylogenetic reconstruction methods.
PGMA - PairGroup Method using Arithmetic mean is a simple distance
based method that comes in two versions Unweigthed PGMA and Weighted
PGMA. The basic principle for the method is to first construct a node
for each sequence and then to combine the two closest sequences in
a new node and calculate the distances from the new node to the remaining
sequences. The grouping is then repeated until all the nodes are collected
in a graph. The two versions calculate the distances between the nodes
different, both assuming the molecular clock theory.
Although theese methods are for building trees they can be used for calculating differences over a given tree as well, with just minor or no changes in the algorithm.
As mentioned earlier, data from microarray experiments are continuous characters and it is not possible to apply the same methods as used on discrete characters. Parsimony methods applied on continous characters can easily be turned into a Minimum Evolution approach.
By means of the minimum evolution method [22], branch lengths
are fitted to a tree acording to an unweigthed least squares criterion,
but the optimality criterion to evaluate and compare trees is to minimize
the sum of all the branch lengths L
One such heuristic method uses a Neighbour Joining tree to guide the selection of candidate minimum evolution trees [26]. This is done by constructing a neighbour joining (NJ) tree and then search for the tree with the minimum value of L by examining all trees that have a closely related topology to the NJ-tree. After isolating one tree as a candidate minimum evolution tree (cME) statistical tests are applied to the cME-tree to identify some topology that have a smaller value for L than the cME-tree does, amongst the closely related topologies to cME. This is done until no tree with a smaller value for L is found. The final tree is used as the Minimum Evolution tree (ME). The idea behind doing it like this is that for a small number of OTU's the NJ and ME topologies are almost identical, and therefore one can use the NJ topology as a starting point when the number of OTU's is large. (Operational Taxonomy Unit (OTU) is se section 4.2).
The tree format used in this thesis is the Schreiber format. This format was chosen because of the widespread use of it in the Liberles group. However, it is not a widely used format. So I have included a schreiber/newick treeparser with the code. The parser is developed by Gisle Sælensminde and briefly explained in section 6.4.1.
A more widely used format for phylogenetic trees is newick.
Trees as a datastructure is normally represented as a string of characters encapsulated by various kinds of brackets constructing the branches between the nodes, normally represented by a node number e.t.c. Different formats have different ways of formating the string.
The Schreiber tree format is very flexible and can be expanded by just adding tags, and can therefore be used to store information in a tree structure, a kind of a low level XML code. It makes use of square brackets and commas (``[``,'']'','','') to build the tree structure. Meaning that each external node is represented by a number and each internal node is represented by a pair of paranteses that enclose all the decendents of that internal node.
For instance in figure 4.3 the internal node [1] would be represented like [a,b] in the string representing the tree.
The tree in the figure above is represented by this string ``n:-[[a,b],[c,d],[e,f],[[1],[2]],[[3],[4]]]''. The string is read from left to right constructing parent nodes for each pair of parantheses that are closed and iteratively giving the parent nodes numbers. When all the leaf nodes are combined, parent nodes are combined in the same fashion as the leaf nodes by pairing their number (enclosed in brackets) inside another set of square parentheses. In general, one could describe the format by giving each external node a number related to the position of the node, and each internal node could be viewed like a list of children denoted [X], where X is a number and [X] is a list of numbers separated by commas (``,''). Only the childnodes of a internal node must be combined before the internal node could be constructed. Meaning that ``n:-[[a,b],[c,d],[[1],[2]],[e,f],[[3],[4]]]'' would also be a legal statement, since both internal node 1 and 2 are constructed one can construct internal node 3 berfore combining external nodes e and f. This would result in the tree shown in figure 4.4.
The newick format is a more common format for representing trees. This format uses regular parantheses and a nesting of these to represent the tree stuctures. A representation of the tree in figure 4.3 in newick format would be ``((((a,b),(c,d)),(e,f)));'' All newick formated trees must end with a semicolon (``;''). There exists an extension of the newick format annotated NHX. This extension is oftens used to represent annotated trees because of the tags introduced in the format. This is a more flexible solution than the original format, but not as flexible as the schreiber format. A more complete description of the newick format is available from Joseph Felsenstein's website.
Reconstruction of ancestral sequences is useful for identifying changes/mutations that cause a functional change of the gene or for detecting darwinian selection. By infering the ancestral nucleotide sequences, we can study how the function of a gene has changed during evolution. As mentioned in section 6.2, there are quite a range of different methods that can be used. In the same way that phylogenetic tree building methods can be divided into different classes, one can also divide these methods into parsimony, likelihood, distance based and counting methods. Most methods assume a given phylogeny and use sequences of extant species on the leafs. Both nucleotides and amino acid sequences can be used as input.
Parsimony methods are able to identify the branch of change when the correct topology is known and the divergence between the sequences is small. This is commonly used in creating phylogenetic trees. Fitch [10] desribes a method for infering ancestral sequences based on maximum parismony formalism. The method uses a more heuristic approach than the brute force method of examining all tree topologies. The method assumes a given tree topology, a set of orthologs and an independent distribution of changes in the nucleotide sequence.
Because of the limitations in dealing with more than one substitution per site the parsimony approach is not the most commonly used method to infer ancestral sequences. Parsimony methods give good results when the sequences are closely related, but when the sequences diverge to much it relies on approximation algorithms to select one unique and most likely ancestral sequence, together with the limitation of not considering branch lenghts, this makes the parsimony method weak in regard to bayesian models.
There are several bayesian approaches to inferring ancestral sequence states for a given phylogeny. Two of the most common known are Koshi and Goldstein [20] and Yang Et.al. [38].
Koshi and Goldstein [20] have developed a method that uses maximum likelihood to reconstruct the ancestral sequences for a given phylogeny. The approach not only calculates the most probable sequence but also calculates the probability of any sequence character in any node in the evolutionary tree. The main idea behind the method is to use structure dependent mutation matrices to include information found in secondary structure to make the reconstruction more certain. Because the approach is based on maximum likelihood estimate of the topology and on the branch lengths, it can be a bit uncertain when the topology is unclear.
Yang Et.al. [38,39] have developed a bayesian approach for inferring ancestral sequences where the tree topology is assumed to be known. The method calculates the branch lengths by means of maximum likelihood with various models of substitution. This is as stated above a likelihood approach that uses branch lengths and substitution patterns for the reconstruction of ancestral sequence states. The method is implemented in the PAML package and is known as the BaseML program, see section 6.3.2 for a brief introduction to the program.
Yang Et.al. [38,39] is based on standard statistical theory and can be summarized as follows :
``Given the data at the site, the conditional probabilities of different reconstructions can be compared and the reconstruction having the highest conditional probability is the best estimate at the site.''The method assumes that the data consists of aligned sequences of extant species that are linked together by means of a phylogenetic tree. There is no need to asume a molecular clock and the branch lengths are calculated as the average sum of sequence substitutions per site. For a more detailed desription of the method, please refer to Yang Et.al..
This chapter is aimed at giving the reader a brief introduction to some of the technologies and program packages that are used in this thesis. The chapter also gives an overview of the system. Including a description of how it was designed and coded and details of the software for it's installation.
The process of developing software can be described using different models. I have choosen to use a component - based model, but other models, like the waterfall model and the evolutionary model could also have been used. The different models have different strengths and limitations, that one needs to consider when developing a system/program. The waterfall model makes it hard to incorporate changes at a late stage in the development, the evolution model makes use of prototyping and could involve lots of unneccesarry work. The idea behind the componenet model is to make use of existing software components and integrate them to build the required functionality. A schematic view of the component based model is presented in figure 5.1.
The component based model consists of the following steps :
The requirement specification includes conversations with the users and the making of a document that specifies the user's needs and what the system should do. Here the input and output of the program should be described. This document is made in close collaboration with the users of the system. Typical properties that should be discussed in this document are the reliability, safety and performance of the system.
After specifying what the system should do, and what kind of data it should take as input a search for already made component and/or software libraries is performed. The result of this search is analysed to establish to what extent different components could solve the problem or a part of the problem the program is being developed to solve.
After collecting and analysing different components one sometimes has to modify the requirement specification. This means finding alternative ways of acheving the same goals as presented in the original specification developed in step 1.
With the modified specification one can start describing the system. The goal of system design is to place the different components in relation to each other and to specify where there is a need for implementing new code to solve problems that no code has previously been developed for and/or to connect components together.
This part of the model takes care of the actual development of the system. Here the components are installed and new code is implemented to connect the components of the system as specified in earlier steps.
The goal of testing and validation is to determine weither the system does what is earlier specified and to assess weither it has reasonable performance.
A detailed description of Java lies beyond the scope of this thesis. Readers not familiar with Java are refered to [30,12] a basic introduction to Java. However, some of the basic properties of the program have been determined by the choice of Java as a programming language, so an explanation of some of the features of Java is appropriate.
Java is ; simple, objectoriented, distributed, interpreted, robust, secure, portable and dynamic programming framework. These are all properties that are crucial to the performance of an up to date object oriented programming language.
The programming language had to fullfil a number of criteria :
Java is an Object Oriented Programming Language. Meaning that Java enforces the object oriented programming style (OOP). OOP is a way of thinking when solving problems that describes the program as a collection of objects that solve small parts of a problem and sends messages between each other. Object oriented programming is also a way of structuring programs such that they become easy to read and modify. By the use of classes, objects and interfaces one promotes information hiding and make each part of the program possible to develop independent of each other. An object can be thought of as a complex datastructure that includes data fields with methods attched to them. By manipulating classes of objects, that have their own data fields and methods, and using libraries with existing and reusable software components OOP becomes a very flexible and well structred programming technique that is commonly used.
The program is developed on the Java API 1.4.2 release and an installation of this version (1.4.2) or higher must be present on the machine that the program will run on, to ensure that all functionality is available. Java is developed so that later releases (currently 1.5.0) are backwards compatible, i.e. release 1.5.0 includes everything from version 1.4.2 plus some additions and changes. Most functions used in the program have their origin in earlier versions of the Java API and would with a high degree of probability work with older versions of the API, but the program is only guaranteed to work with Java 1.4.2.
Netbeans is a pure Java coded IDE that was used to write the code. Netbeans is free, well documented and under constant development. Versions 3.5, 3.6.2 and 4.0 beta have been used during the coding. Since NetBeans is coded in Java it is possible to use it both under Linux and Windows, the accessibility that this property gives the product was highly appreciated and valued. Netbeans documentation and other information can be found on http://www.netbeans.org and like the program itself it is freely available.
com.Ostermuller is developed by Stephen Ostermiller and is a Java library that contains classes to solve miscellaneous problems such as file parsing. From the library I've used classes to parse .csv files (comma separated values). These are files that hold one data entry per line with each attribute separated by a comma. These classes made parsing csv files easier and more accurate than if only core Java classes had been used.
Ostermuller.com is a relatively large Java library, that contains a lot of useful methods. Only the classes designed to parse .csv files (comma separated values) are included in my program. The other files and classes contained in this library are excluded to save space and limit the size of the program. The entire library can be found on http://ostermiller.org/utils/. The library is free software, and can be redistributed and/or modified under the terms of the GNU General Public License.
In the implementation of MolMine's JExpress program Bjarte Dysvik has implemented classes for reading data from AffyMetrix Genechips. The code is based on methods and theories from the Statistical Algorithms Description Document found on http://www.affymetrix.com. The basic functions of the code is to read data from AffyMetrix Genechip .cel files and .cdf files using the statistics and normalisation procedures described in the SADD document. The data that are read out of these files are filled into a dataarrays and represent the gene expression levels of the genes in a specific leaf node.
Mirror, mirror on the wall, which is the best database of them all? The rapid growth of databases containing biological information makes it difficult to find the right one for each specific project. To help guide the user trough the jungle of databases Galperin M.Y. has developed a collection of molecular biology databases. The collection [11] is a starting point for searching the web for reliable information, and all the databases in the collection are free to the public. The database list uses a hierarchical classification and is available at http://nar.oupjournals.org, the online version contains more than one listing of some databases, due to the classification system. This means that the online version is a more complete listing and should be preferred instead of the printed version. EnsEMBL is a bioinformatics project to organize biological information from sequences from large genomes, and therefore very suitable for this kind of project [3].
When looking for a database for this project there where several candidates to choose from. EPD [27]- Eukaryotic Promoter Database is a small database with about 5000 entries. It consists of eukaryotic promoters which are experimentally defined by their transcription start site. It soon became clear that this database did not cover the data needed for this thesis and the database was rejected. By rejecting the EPD database the search for promoter regions soon got turned into a search for upstream regions instead. Both the UCSC genome browser system [16] and the NCBI genome resources [35], displays information on a genome basis as well as Ensembl [3]. By changing the specification to use upstram regions instead of annotated promoters both [3,35,16] became useful for the project. The reason EnsEmbl was choosen over the other two candidates was because of the quality of the user interface and the ease with which genes can be compared and homologs can be collected from different genomes.
Ensembl is freely accessible and consists of several levels from which data can be collected. Ensembl can be accessed via an html interface, a java API, a batch query tool (EnsMart) and trough different tools that can be locally installed on your computer. The data can be retrived as flat files, FASTA formated files or raw dumps. All these different ways of accessing Ensembl make it extremely flexible and easy to work with. In this project the user can access Ensembl through EnsMart. But Ensembl and PHYREX could be more closely connected (See Section 9.2).
EnsMart is an EnsEmbl tool that allows the user to run batch queries against the EnsEmbl database. It is structured to allow genome queries and therefore a very useful tool in this project. This section describes how the user should use EnsMart to retrieve the data from the EnsEmbl databases in a manner that suits this program. There are two different data files that need to be downloaded from the EnsEmbl database.
First, the user has to collect a file containing all the sequences of the upstream regions of the genes in study. This can be done by following the recipe described in Algorithm 13.
After collecting sequence files for all the species the user must collect a file containing ``maps of homologs''. This is a mapping of EnsEmbl id's of the genes from each species that are homolog with each others. This can be done by following the recipe described in Algorithm 14.
A more thorough explanation of Ensmart and EnsEmbl in general can be found in [29,17,25] and of course on their website http://www.ensembl.org.
The fasta format is a specific way to format database retrievals. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.
The chapter is aimed to give the reader a overview of the system, it should provide the reader with insight into how the theory described earlier in the thesis is used in PHYREX as a system. It also provides some implementation details and package specific details needed before reading Chapter 7.
The program can as Figure 6.1 describes be divided into two parts. The left side of the figure describes the gene expression reconstruction and the right side describes the sequence reconstruction.
The purpose of PHYREX is to use data given by the user to reconstruct the ancestral states of gene expression profiles and sequence data in a given phylogenetic tree. The words reconstruct, derived or populated, all refer to the use of data from the leaf nodes in the tree to populate the internal nodes with calculated values for sequence and gene expression data.
The PHYREX module at the top of the figure collects data from the user. It collects data on two levels and both for gene expression and sequence data. The two levels it collects data from are the tree level and the node level. On the tree level it collects a tree to guide the reconstruction, a map of homolog genes in the species and some microarray specific files used to set up the microarray reading system. On the node level gene expression and sequence data are collected for the species in the study.
The microarray data are then reconstructed by the PHYREX framework by use of a minimum evolution algorithm as described on the left hand side of the figure above. The algorithm is described in detail in Section 6.2.
Sequence reconstruction is performed as described on the rightmost side of the figures by using the PHYREX framework to collect and sort the sequence data. After the collection and sorting, the system calls ClustalW to align homolog sequences from the species in the leaf nodes of the guide tree. When ClustalW has aligned the sequences the framework calls BaseML to calculate theoretical sequence information on the internal nodes of the tree. A more thorough description of both ClustalW and BaseML can be found in Section 6.3.
Once the ancestral gene expression profiles and the ancestral sequence information has been calculated the frameworks collects and structure the new information and displays it in tables. The tables describes the change between nodes and the values in the nodes for both data types.
As seen in Appendix A and Section 6.1.1 the aim of the project is to find and/or develop methods that enable the reconstruction of sequence and gene expression profiles of the ancestral states in a given phylogenetic tree. The first part of the project was to describe algorithms for constructing ancestral gene expression states. In the process of finding/developing such an algorithm several options such as parsimony, likelihood and minimum evolution methods were tried. All of the methods are used on discrete characters, but the Minimum Evolution method was found to be more appropriate for converting to continous data such as gene expression states. When considering the sequence reconstruction I choose to implement connections to ClustalW and BaseML instead of implementing such algorithms in the PHYREX code, since both programs are well thought of and there is a common understanding of their accuracy. Both ClustalW and BaseML are described later in the thesis. The implementation which is written in Java and takes as input gene expression data from microarrays, sequence data from the upstream regions of the genes on the microarrays, a tree and a list of homolog genes from the species in the experiment. All the input files are collected from either EnsEMBL or Affymetrix databases. The procedure for downloading these files from their databases is closely described in section B.3.2.3 and in the user document that can be found on the web.
As seen in figure 6.2 there are two different types of input files, some that are node related and some that are tree-related. This makes the mapping of each file more complicated than if all the files where to contain information on the same level in the process. To solve this problem I have made a GUI to map the right file to the right node in the tree and code to find homologs and build a gene and an edge array for each of the nodes containing the node specific information from the inputfiles.
Data from microarray experiments are continous characters, thus most discrete character, e.g. individual sequence positions, methods cannot be used, However, the reconstruction of continous characters can be done using a Minimum Evolution approach. A range of values is obtained by running through the phylogeny and calculating legal intervals and then narrowing them down to a minimum. Choosing a representative value for the expressions can be done in several different ways, in the thesis I have have choosen the midpoint of the minimum interval to be the representative value for each gene expression.
Regulatory elements in the upstream regions of a gene control the gene expression, by looking for branches with significant changes in gene expression and comparing them with an analysis of point mutations in upstream regions in the genes it is possible to isolate connections where there are significant changes in gene expression and many point mutations and hence identify sites with important functional roles in regulating gene expression.
Reconstruction of ancestral states of characters over a given tree can be done in several different ways. Koshi and Goldstein [20] use a maximum likelihood formalism and Yang Et.al. [38] describes a statistical method. Both methods are constructed for nucleotide or amino acid sequences but one could easily use similar methods on other characters such as gene expression profiles. The method of Yang Et.al. [38] is described in more detail in section 4.6, and is the method used in BaseML.
The reconstruction of continous characters can also be done using a Minimum Evolution approach. As seen in the algorithm below a range of values is obtained by running through the phylogeny and calculating legal intervals and then narrowing them down to a minimum and choose a representative value.
Given a tree topology, information on the leaf nodes and an evolutionary model the internal information can be constructed and the evolutionary change can be assigned to the edges. My approach is described in figure 6.4. As mentioned earlier the algorithm developed in this thesis runs through the tree two times for each position in the gene table. The first iteration is to construct intervals of allowed values. There are two categories of intervals one needs to considerate in this iteration. Each node is represented with a minimum and a maximum value. The values construct intervals that may or may not overlap. See figure 6.3. When they overlap the algorithm calculates the intersection between the intervals, but when they do not the algorithm constructs a interval X of the values in between A and B.
The figure show how the intervals are constructed by the algorithm. If the two previous intervals are overlapping the union of these are used, but if they are not overlapping the new interval is constructed by interval X that can be constructed in between A and B. |
The second iteration runs trough the nodes and tries to minimize the allowed intervals. This is done by checking the upper and lower limit of the parent node interval and the interval of the node itself. If the upper limit of the parent node is lower than the upper limit of the node itself or the lower limit of the parent node is higher than the lower limit of the node itself, then the limits of the node are changed to be the same as the parent node limits.
In the same iteration the interval is narrowed to a minimum interval of the node the algorithm also select a representative value of the gene, meaning a theoretically calculated gene expression profile for the gene. This is done by applying the rules described in Algorithm 4.
These calculations are done on a gene expression level, so they are executed several times for each node. A better description of how data is divided and calculated is given in Section 7.1.7.2.
Within the field of Computational Biology/Bioinformatics, there exists a large collection of commonly accepted algorithms and methods used to solve different tasks and problems. More and more of theses algorithms are being implemented by people around the world and published and/or made freely available on the world wide web. A quick search on google or other search engines on problems such as sequence alignments or evolutionary reconstruction results in long lists of programs that can solve these problems. PHYREX makes use of some externally developed program packages. This section introduces these packages and describes where to find documentation. The searching and reuse of public available program packages described in this and the following sections is what makes up step 3 in the Component based model described in section 5.1 and figure 5.1.
ClustalW [14] is implemented at the EBI as both a website and a standalone application freely available for download from the EBI website (http://www.ebi.ac.uk/clustalw/). In phyrex a locally installed version of the program is used. The standalone version of ClustalW can run both as a interactive program with textbased menus and as a command line program with arguments. In Phyrex, I take advantage of the last option and run it from the command line using the same built-in methods to invoke ClustalW as I use for running BaseML. The implementation of ClustalW-specific code in Phyrex has many of the same properties as presented in the PAML section (6.3.2). A code view of the implementation is therefore not presented here. There are, on the other hand, some differences and ClustalW specific tricks that had to be implemented. Both ClustalW and BaseML take input from and write output to files. Unfortunately, they does not use the same format for their sequence files. Therefore, I had to implement methods for changing the sequence format from ClustalW to BaseML formats. This involved yet another read and write procedure which may seem unnecessary at first, but is needed to convert the different formats.
The aim of this chapter is to document the Phyrex system and not the external program packages that Phyrex uses. Therefore, I begin this section by introducing the reader to the documentation of PAML made by the author [39] and an article where BaseML is introduced [38].
Phyrex uses both gene expression data and sequence data. The gene expression data is entirely dealt with by Phyrex algorithms but the sequence reconstruction is done with BaseML. Core BaseML uses as the name indicates a Maximum Likelihood approach to reconstruct the ancestral sequences of nucleotide bases. A more thorough explanation of the principles behind the reconstruction can be found in [38].
When describing the use of BaseML in Phyrex the most appropriate place to start is with the implementation. BaseML is called from Phyrex as a command line argument through built in Java classes like Runtime and Process. This makes the program run in the background, independant from the Phyrex process. Because of the clear sequential structure of the calculations, Phyrex must wait for BaseML to finish before it can execute the next command.
BaseML uses a control file to tune its parameters before it calculates the sequences. This control file is a text file with one parameter and one value on each line. In phyrex, these parameters can be tuned before running the calculation. When clicking the PAML button in Phyrex one opens the panel shown in figure 6.5. For a full description of the parameters the reader is refered to the PAML documentation. In this panel one can adjust the methods and output rate BaseML should use in the calculation.
BaseML is a program for calculating ancestral sequences in a given phylogenetic tree. It uses two types of parameters, a phylogenetic tree and sequences at the leaf nodes. Since it does not calculate any guide tree or other type of alignments the input sequences have to be aligned before entering BaseML, in Phyrex this is done by ClustalW which is presented in section 6.3.1.
Because of the differences in tree formats in the Phyrex program and the tree formats used by BaseML and ClustalW the program package comes with a treeparser included. The treeparser is developed by Gisle Sælensminde and is distibuted as a Linux binary file together with a script file used to invoke the process. The parser is developed in CLisp, resources can be found on http://www.cons.org/cmucl, but distributed as a package including both the source code, the CLisp interpretator and the framework needed for it to run on any linux based system. This make it easier to use since one does not need to have any CLisp dependant libraries installed. But it also makes the distribution file quite large (19 mb).
The treeparser works as an standalone command line application as well as a part of the Phyrex program.
The treeparser has minimal documentation but here are some guidelines for using it as a standalone application if that is needed. The parser relies as mentioned on two files. From the figure below you see the script file treeconvert and the core file treeconv.core which includes all the libraries and source code.
Figure 6.6 illustrates a theoretical call to the treeparser from the command line. It consists of a call to the script file (treeconvert) with three arguments. The first argument (treetoconvert) is a string describing where the program can find the file containing the tree to convert from. The second argument is a filename for the output file and the third argument is the output directory where the output file is to be saved. Some constraints on the arguments exists. The first argument must contain a valid file with extensions and the path to the file. The second argument must consists of a filename with extentions and the third argument must consist of a . (if the file is to be placed in the current directory) or a valid path to the directory it should be placed in.
The treeparser does not need any information on which format it should parse to or from, it collects information on the format from the file given in the first argument and parses the file to the opposite format. It works as a converter both to and from the schreiber and newick formats.
To print the schreiber format trees on the screen I have used the tree viewer that Saywan Pasha developed during his master thesis. This treeviewer program has been modified to incorporate some other functionality that are needed for this project. These changes have been presented to the author and he has accepted them. I will describe these changes here and refer the reader to the Cand. Scient. thesis [24] for a full description of the entire treeviewer package.
Treeviewer was, as the name suggested made for displaying phylogenetic trees and related information. The original logic was to show a information box when clicking on a node in a tree. This is changed to display the information in another part of the program instead. This is done because I wanted to add some node specific functionality. Instead of showing an information box, treeviewer now updates the Node Info part of the GUI to present the node information from the node that is clicked and to make the mapping of input and analyzing of the output as easy and straight forward as possible.
Since treeviever originaly was developed to run as a web applet the entire package is modified to run as a part of an application instead. Some changes to the code have been made to make this possible. These changes have made the program more user friendly since they avoid the user having to use multiple windows.
To store information about the nodes, external as well as internal, the Treeviewer uses a NodeObject class. To use Treeviewer as a part of Phyrex this class has been redesigned to store the kind of information needed by Phyrex. The class has been widely expanded and now contains all the information that creates a species in Phyrex.
Our test set is data from a Science article by Enard Et.al. [8] and contains two brain and two liver samples from one Orangutang, three Humans and three Chimpanzes. The samples are collected on the AffyMetrix GeneChip HG_U95 and HG_U95A_v2 microarrays which respectively are designed for human brain and human liver. Because of the close relation of these species, the samples are belived to cross hybridize in such a fashion that it will not affect the result to any degree.
A problem with this test set is that there is no genome sequencing project for the Orangutang, and therefore, there is a lack of sequence data. For the purpose of testing the method, these sequences are collected for Mus Musculus and used as an outgroup in the sequence reconstruction and in the comparison with gene expression. The method is developed for use on whole genomes and will be more usable when more genomes are sequenced.
Another problem with this test set is that, even if the species are closely related and are supposed to cross hybridize with each other some normalization should be done between the species to make a direct comparison of the gene expression values more accurate. This was not done by Enard Et.al.
PHYREX is a program for reconstructing ancestral states in a phylogenetic tree based on minimum evolution algorithms for reconstructing gene expression profiles in the ancestral nodes of the tree. The program is developed as a standalone application on the Linux platform. It gets it parameters from five different file types that are all parsed and computed in different ways. PHYREX is developed in Java and uses components developed in C, C++ and CLisp. The purpose of this chapter is to give the reader an introduction to the program development.
Software for reconstructing phylogenies already exists but is based on other types of data, E.g. MEGA2 and PAML calculate the trees based on sequences and/or distances. Most of the software available calculates the trees as well as populating them. The big difference between these programs and PHYREX is that PHYREX combines gene expression reconstruction and sequence reconstruction together and the comparison of the results found by the two types analysis.
The system input files have different formats, some of them are species dependent and exists in many copies, others follow the analysis and exists in only one copy. The main provider of these files are the Affymetrix experiment and the EnsEmbl database.
From the AffyMetrix experiment, the following files must be collected:
The <name>.cdf file holds information about the structure of the genechip used in the experiment. It maps the perfect match and mismatch probes against each other to construct a probe pair and specifies the ``algorithms'' used to construct the probe sets. The .cdf file is formated in a special tab separated format developed by AffyMetrix and needs a specially developed parser to be read properly. For each experiment to be analysed in PHYREX one .cdf file is needed. The file uses a name convention that specifies the GeneChip it is developed for. In the test set used in this thesis the file is called HG_U95A.cdf
The .cel - files holds the values for each probe on the GeneChip. The values are represented with a number, positive or negative, describing the amount of target DNA that did bind to the probe. To run a PHYREX analysis you need one .cel file per species in the analysis (one per leaf node in the tree). The file is tab separated following a similar but not identical format as used in the .cdf file and needs a specially developed parser.
The annotation file either follows the AffyMetrix experiment or can be downloaded from the AffyMetrix wesite (http://www.affymetrix.com). It contains a list of all the id's used on a GeneChip and maps them to other id's in public available databases such as EnsEmbl. The file is formated as comma separated values and follows a similar naming convention as the .cdf file (testset : HG_U95A_annot.csv) This file can be obtained in other formats as well, but support for these formats are not developed in PHYREX.
From EnsEmbl the program needs the following files to perform the analysis:
One needs to collect two file types from EnsEmbl, as described in section 5.3.2 . These files must be collected for each species in the analysis. The sequence file is fasta formated. See section B.3.2.1.
The other file type that needs to be collected from EnsEmbl is a file containing a map of homologs from the species in the analysis. For each analysis in PHYREX one such file needs to be collected. The file must be collected as a list of comma separated values and not any of the other available formats. The file uses the same file parser as the GeneChip annotation file.
Other files needed are:
The treefile must be in Schreiber format and should contain a phylogeny describing the species involved in the analysis. The best performance of the program is obtained if the tree is described with both a dependency line and a label line. The labels should reflect the name of the species on the leafs. An example of such a file is shown below.
Since PHYREX is coded in Java, the GUI could have been implemented either as an applett or as an application. The biggest limitation with developing the program as an applets, is that applets do not allow access to files and directories on the users computer, therefore PHYREX was implemented as a standalone application. More ``recent'' technologies like Web services and application servers could have been concidered as alternatives, but from the specification of the thesis this is considered not recomendable. The next section summarises some of the design specific choices and some information about modules, components and data structures used in the program.
The filesystem gives a brief overview of the PHYREX packages and modules. The system is quite simple and is constructed such that the different external programs or libraries have their own folder and the core that interact them is placed in the rrevolution folder. The evolutionfolder contains all the classes and externally developed packages used by the program. It is the PHYREX folder that is zipped and distributed on the website.
Many different data structures have been used in PHYREX. This section summarises the properties of some of the data structures used. The structures presented are the ones that give an advantage when used and/or are the more advanced / complex structures.
Even if the PHYREX framework classes do not use these data structures directly they are used by the treeviewer package which is incorporated in the program. Since some of the features are used in the PHYREX clases and the NodeObject class in the treeviewer package is rewritten in this theis a presentation of the basic tree and node classes found in Java is apropriate. The tree datastructure is represented by the root which is an instance of the DefaultMutableTreeNode class. All the nodes are instances of the same class.
DefaultMutableTreeNode is a general-purpose node in a tree data structure. A tree node may have at most one parent and 0 or more children. DefaultMutableTreeNode provides operations for examining and modifying a node's parent and children and also operations for examining the tree that the node is a part of. A node's tree is the set of all nodes that can be reached by starting at the node and following all the possible links to parents and children. A node with no parent is the root of its tree; a node with no children is a leaf. The class provides methods for traversing the tree and holding references to user objects (the getUserObject method). By using the getUserObject method it is possible to construc an entire class to store the properties of a node in. This is done in the thesis by the NodeObject class.
A Hash-table is a set of key : value pairs where the key is used to put or get the value to or from the right bucket. Any non-null object can be used as a key, as long as the class constructing the object implements the hashCode method and the equals method found in the class Hashtable.
An instance of Hashtable has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. In the case of a collision, a single bucket stores all the entries, which must be searched sequentially. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hashtable exceeds the product of the load factor and the current capacity, the capacity is increased by calling the rehash method. The default load factor is generally set to 75%.
As you can see from the algorithm above one can put and retrive values
in an hashtable using the same key. This makes it an invaluable datastructure
when having unique identificators on the values used. In the PHYREX
program package hash-tables is used in the mapping of sequences and
id's to the nodes in the tree. Without the use of hash tables in these
operations the program would be using at least an order of O(n
)
to map all the different values in the tree. Instead it is now running
in a linear time. This section draws from the Java 1.4.2 API [15].
PHYREX is made up by 10 Java classes (here called modules), four externally developed libraries (called packages), stored in the folders com, imageanalysis and treeviewer. Treeviewer is in fact a program but due to the tight interaction with PHYREX it is considered as a package. The Bnb folder contains externally developed programs that PHYREX uses (Treeparser, ClustalW and BaseML). See sections 6.3.1,6.3.2 and 6.4.1 for detailed information about these packages. The rest of this section describes the modules found in the rrevolution folder.
Figure 7.5 is aimed to show the most important interactions between the different modules and packages PHYREX is made up of. The modules is displayed as rectangular objects and the packages is displayed as folders. The interactions, described as lines, show wich modules or packages that communicate with each other. The main modules in PHYREX is the Rrevolution, DataMatrix and Bnb module. From these three modules most of the computation is arranged and performed.
The Bnb module handles almost all the communication between the PHYREX system and the externally developed programs. It has methods for reading input streams and writing to output streams from ClustalW, BaseML and the Treeparser. The module makes use of the Runtime and Process classes from the Java API. The Runtime class provides a framework for Java to talk to the environment in which the application is running. Every Java application has a single instance of class Runtime. The current runtime can be obtained from the getRuntime method. By using the Runtime.exec method a native process is constructed and returned as an instance of the Process class. The process class can be used to control the process and send or retrieve information from the native process trough a java class. This is done by methods such as getOutputStream(), getInputStream(), kill() and waitFor().
Some native platforms only provide a limited buffer size for standard input and output streams, by not handling the input stream or reading the output stream of the subprocess, the program may cause the subprocess to block or even deadlock the entire system.
DataMatrix is the module in PHYREX that takes care of most of the calculations. It is by far the biggest module of the system and has methods for populating the NodeObjects, calculating gene expression profiles and file parsing. The module makes use of both the datastructures presented in section 7.1.3.2 and as you can see from figure 7.5 it handles the input from both the ImageAnalysis package and the Bnb module. Since the Bnb module only is used for reading and writing to BaseML and ClustalW all the logic of analysing and parsing this information is done in the DataMatrix module. The module makes extensive use of the Tree datastructure and file reading and writing.
XMLparser is the module that implements the save and load functions of the system. As the name inplies the saving is done in an XML like fashion. The module uses a range of classes to split, read and write the data to and from XML formated files.
Two different set of classes have been developed for interacting with XML files. These are called ``sax'' and ``dom''. In this module, I have used the ``sax'' set since the files to be read are quite large and the sax method enables the program to read the files bit by bit. The ``dom'' framework needs to load the entire XML file into the memory and is not appropriate for large files.
Rrevolution is the main module in the program, it takes care of the graphical user interface (GUI) and as you can see from figure 7.5 it invokes many of the other modules. The Rrevolution module also handles all the user interactivity. To perform these task Rrevolution uses a lot of classes and methods from the javax.swing libraries such as JFileChooser, JButton, JLabel etc. It also takes care of some logic and serves as the module that binds the system together.
The DBLocal module is used to filter annotation files from AffyMetrix. It concist of only one method used to read a comma separated file, filter and store the information needed by PHYREX in a local ``database'' - file. This file is then again used in the sequence mapping to assign the right sequences to the right nodes in the tree.
The GeneFilter module is developed to filter out the data that should be analysed. The module can run own as a standalone program that takes a list of parameters, but its main usage is inside PHYREX were it gets its parameters from the AnalysisFrame module. The GeneFilter module produces files that contains candidates for searching TESS. The output from this module will be more explained in Chapter 8.
DataMatrixFrame, EdgeFrame, AnalysisFrame and PamlFrame are modules used to present information in the GUI. They each use components and methods from the javax.swing library to construct a GUI element to present data in.
DataMatrixFrame and EdgeArrayFrame are used to present information found in the NodeObject of the node clicked in the tree. They use a JFrame extension as a container where a JTable is added and populated with information from the data or edge array in the NodeObject class.
AnalysisFrame and PamlFrame differs slightly from the two other modules. AnalysisFrame and PamlFrame is used to display information and collect user input to tune the analysis and the performance of BaseML. It uses JLabels, JTextAreas and JButtons to display the information. In addition to having a framework for displaying this information AnalysisFrame and PamlFrame modules also has methods for reading and writing these values to and from the system memory or control files.
Even if the software development is well structured, some problems are bound to occur. In this case, the problems that occured had to do with the input of data to the program. In this section I will describe some of the problems that occured and how I choose to solve them.
The input data to the program is a set of files. Each file is formated
in a special format and needs different parsers to be read. In addition,
the program needs to collect information from a file based upon a
combination of data from other files. This makes the data retrieving
job much harder. A brute force approach to this problem would be very
time consuming for exapmle, for each line in one file the program
would have to read trough all the files in another set of files in
order to match the correct values. This could easily lead to an O(n
),
O(n
) or worse time complexity trying to solve this.
My solution to this problem is using Hash - tables, as described in
section 7.1.3.2.
The solution of this problem could best be described with an example. In algorithm 8 a listing of the code used to find the ensembl id's is described.
The ensembl ID mapping consists of two files. The first file is the annotation file from Affymetrix and the second file is the homolog map collected from EnsEmbl. In the annotation file a list of GeneChip IDs mapping to IDs contained in major databases, including EnsEmbl is described. Since the GeneChip is constructed for only one species (e.g. human in the test set) one needs to map the human EnsEmbl ID to the homolog gene in the other species under study. This can be done using the file containing homologs collected from EnsEmbl. This is an example of how to find one ID in the annotation forces the program to read through the entire homolog file to find a correlating id, this is repeated for each id and can be very time consuming. The code above describes a solution in which one reads one time troug each of the two files. This is done by first reading through the annotation file and storing the information in two hash-tables. one with the affy ID as key and the ensembl ID found in the annotation file as value and one with the affy ID as value and the ensembl ID as key. This is done in the method makeHash() in the above algorithm (8). The second file is read in the expandHash() method and takes care of the mapping of ensembl IDs. It reads the file containing the homolog IDs and expands the hash table containing the affy ID key and the list of ensembl IDs. To populate the nodes with ensembl IDs, the affy ID can be used as a key and find the ensembl ID that correlates with the species by searching trough the list found in the hashtable. Similar solutions are used to collect the sequence information from the sequence files.
The input files contain information that should be linked to different modules and / or elements in the program. The mapping of the right file to the right ``place'' in the system becomes a problem. To solve this a GUI was needed, see figure 7.1. The GUI consists of three main parts. The command panel on the right hand side, the tree panel and the node panel on the left hand side. The command panel contains buttons for loading the files that are linked to the analysis, i.e. the files that only have one copy per analysis. After loading a tree, the tree panel gets activated. By clicking on a node in the tree the node panel displays information about that node. The node panel, contains GUI elemensts for loading the information which should be linked to each node or species. There are multiple copies of this type of files in each analysis, e.g. the sequence and .cel files.
This section presents the implementation techniques, and code views of the different modules. The purpose is to make the understanding of the code easier to the reader or to users who want to expand the code with new functionality e.t.c. The first four modules is presented in more detail than the last three, because these are the modules that deal with the problem solving. The last three modules are used to present gui's and collect input.
Rrevolution is the main class of the program. It is the object of this class that starts and runs the program. The class can be divided into four main parts. The main method, the construction of the GUI, the data collection and the invoking of the processes and other classes.
Most of the code in this class is devoted to constructing the GUI. The GUI code is located in different parts of the class because this is the way the NetBeans IDE generates GUI-code. It consists of a declaration part and an initialisation method which is called from the constructor.
The data collection is handled by the section of code located between the GUI initialisation method and the GUI declaration. It consists of a range of methods used to handle the action requests sent when the user clicks a button in the GUI. These methods take the user input and store it or send it to the right objects for processing.
The running of the program is done by the main method and the constructor. The main method creates an instance of the class, resizes it, gives it a look and feel property and displays the program.
The logic defined in the class is found in the methods run() and render(). The run method invokes the treeviewer package and gives it the information needed to construct a tree. When receiving information back from the treeviewer package it takes care of displaying the tree in the tree panel. The render method is used to perform the calculations over the tree. It calls the DataMatrix class and invokes the calculation methods sequentially, after each calculation it updates the GUI.
The DataMatrix class consists of a collection of methods used to perform the different calculations needed in the analysis. The class, functionality can be divided into three main tasks: parsing files, calculating the nodeobjects and updating them.
Many of the methods in this class are dedicated to parsing different files or calling methods in e.g. the ImageAnalysis package to do so. These methods use the recipe described in section 7.1.5 and maps the results to different data structures in the node objects.
The calculation of the leafs and the internal nodes are done separately. This is because of the differences in the data that needs to be calculated and because of some special features with the tree structure. The calculation of leaf nodes consists of mapping the right information to the right node, and setting up some system dependent features e.g. the min and max arrays in each nodeobject. In the internal node the calculation consists of copying IDs from the leaf nodes and calculating gene expression values and sequences.
The third task is to update the nodeobjects with the new information. This is not separated from the other two tasks in the code. Almost all the methods that calculate some new information end with updating the node object with the information needed.
The XMLParser class is a framework for saving and loading information in the system. It consists of a method that takes care of saving all the information to an XML formated file. And a range of methods to load the information again. The save method uses a FileWriter object to write the information, but structures the information by the use of tags (almost like HTML, described in section 7.2.2.1). The framework for loading information from a XML formated file is a bit more complex. To solve this problem one needs methods to process the different tags in the right way. In this program you will find methods for monitoring the load process and dividing the loading up in elements and methods that process each of the different elements, tree, node and gene.
Bnb handles the information exchange between PHYREX and the external programs; BaseML, ClustalW and TreeParser. All the external program collect their input and store their output to files. This means that the communication between them and PHYREX involves executing the programs with the right files as parameters. In Bnb both BaseML and ClustalW have a read and write function. The treeconverter only have a write function. In addition to these methods there are some methods that take care of format differences and wash the sequences, by taking away illegal characters. The write and read functions take care of the input and output streams that are established when calling an application external to the Java framework.
DBLocal contains one method used in this program. The reason for making an entire class for this one method is that future work aimed to make the program communicate with external databases is ment to be placed in this class. The method makes use of the com.Ostermiller package methods for parsing comma separated files.
GeneFilter makes use of some of the Hashtable solutions mentioned in Section 7.1.5. It can run as a commandline program with two, three or eight parameters or as a part of PHYREX. Depending on how many parameters it is given (from PHYREX or as commandline arguments) it perform different calculations. If the program is run with eight parameters it collects the candidate genes from the edges of the distribution along with a set of control genes from the middle of the distribution. If it is run with two parameters it alignes the candidate genes with their homolog genes in the other nodes in the tree. If it is run with three parameters it generates output files that can be loaded into TESS and isolate the mutation sites in the candidate sequences. The explanation of this module should be read in context of Chapter 8. GeneFilter generates analysis data for the trippletts that are specified by the user in the GUI.
As stated earlier these methods are used to present parts of the GUI. They are used to display information collected from the nodeobjects. To do this they use the JTable component from the javax.swing library. The JTable is a component that display a data array with flexible columns and rows. The JTable has many features that can be used to tune the performance and view of the data. In this program none of these features are used, because of the nature of the data and because no editing is done in the table since it is only used for shoving the results of the calculation.
PamlFrame displays the information found in the control file of BaseML and can be edited and saved. The information edited in this frame will be used in the calculation of the next analysis. The program only uses one control file, and precautions must be taken to ensure that the analysis is executed with the right parameters. A change in the control file will be valid until the user restores the default values again.
AnalysisFrame displays information and enables the user to alter the parameters used by the GeneFilter module when performing the next analysis. AnalysisFrame only stores the values added in the system memory, meaning that everytime the system restarts the analysis is set up with default values. AnalysisFrame uses, similar to PamlFrame, components from the javax.swing library to display the information in a JFrame component.
Abstraction is a fundamental principle of modelling. A system model can be created at different levels, starting at the higher levels and adding more detail as more is understood about the system. A good model displays the information that is relevant and hides details so as to not confuse the bigger picture.
The Unified Modeling Language is a modelling syntax aimed at creating models of software-based systems. UML is a language: whereas Java is a programming language, UML is a modelling language with a set of modelling elements, diagrams and rules that enables the user to specify the system under construction on different levels of details. It does not tell what diagrams to create, but supply a framework for constructing diagrams. It is designed to be user extended to fill any modelling requirement. UML works with any programming language, meaning it introduces no language dependent properties. There exists lots of software that makes drawing the diagrams more easy. In this thesis I have used Dia, which is found on the local Linux installation at the University of Bergen.
UML is well documented, there exists guides and references online and many books have been written on the subject. But despite these efforts the language is not always well understood. It is a generic language and needs to be adapted by the user to suit particular applications.
The diagram below aims to give a presentation of the modules and data that exist in the program. The modules and data can be external or internal and come from many different sources. The main objects7.1 of the program are the Tree and the Node object. A Tree and a Node object are defined by their contents in a hierarchical structure.
Starting from the top, a Tree object is defined to be a collection of Node objects that are linked together and a Node object is a collection of different datastructures including a DataArray and a EdgeArray object. The Data and Edge-array objects are defined by a collection of different data structures in the same way as the Node object. Other components used in the program handle the CDF parsing, the parsing of the map of homologs and the affy ID to ensembl ID mapping. The program also uses three externally developed modules; PAML, ClustalW and treeViewer which are described in the sections 6.3.2, 6.3.1 and 6.4.2.
As described in earlier sections PHYREX uses, stores and calculates more information and data than described here, however the data described in this section is concidered the core data of the system and building blocks of the system.
The sequence diagram above is aimed at giving the reader an overview of how the information flows in the system and in which order the different processes are invoked and calculations performed. The information flow starts with the user clicking on the tree button for loading a tree. The treefile is saved in the Rrevolution object and passed on to the treeviewer package. The treeviewer package calculates the tree, and builds a treestructure based on the treefile. The treeviewer package draws a tree and returns the root to the Rrevolution object which updates the GUI. The user can now load the rest of the parameters into the program. After loading all the parameters the user starts the calculation by clicking the calculate button in the GUI. By doing so he tells the Rrevolution object to create a DataMatrix object that is used to calculate the values in the tree. The DataMatrix object is called with different methods in a sequential order from Rrevolution, and responds by updating the NodeObject with new values. First, the gene expression values are collected with help from the ImageAnalysis package. Then, the internal nodes gene expression values are calculated by the DataMatrix object itself and the sequence values for the leaves are collected (also done by the DataMatrix object). When all sequences are collected the ancestral sequences are constructed by invoking a bnb object that communicates with ClustalW and BaseML. The values are returned and the node object and GUI is updated. The DataMatrix object then does some cleaning up and selecting before it lets Rrevolution update the GUI and present the finished calculation to the user.
Phyrex is made public available at http://www.rossnes.org (later this will be changed to http://www.rossnes.org/phyrex). The development of this webpage is considered to be a part of the thesis itself and this section documents its development.
The website provides the user with documentation, background information, source code and installation instructions. It contains presenting news and important notices to the user (Home), background information and motivation (Background), Documentation (Documentation), this document (Thesis) and a Worklog page where changes and updates are described in more detail. The user can also find contact information and a range of different downloadable files at the site.
The webpages are developed in a CSS-styled layout. It is based on a right-aligned two column design where the main part of the page is shown in the left part and the menus and other short information is presented in the right part of the page. Since I am the only maintainer of the site it has been developed as a static site with no database support. This has to be changed if more people are to be involved in the updating and development of the program and website. For now this is not nessecary and has not been implemented. If such an extension is needed one idea is to let the service group at the CBU handle this.
HTML is the standard for publishing hypertext on the World Wide Web, and can be created and processed by a wide range of tools, from simple plain text editors to sophisticated WYSIWYG tools. HTML uses tags such as <h1> and </h1> to structure text into headings, paragraphs, lists, hypertext links, etc. To publish information for global distribution, one needs a universally understood language, HTML is used as such a publishing language for the World Wide Web.
HTML enables its users to :
HTML 4 is the newest revision of the standard, it extends HTML with mechanisms for style sheets, scripting, frames, embedding objects, improved support for right to left and mixed direction text, richer tables, and enhancements to forms, offering improved accessibility for people with disabilities. Later HTML 4.01 has been released, including error corrections and some add-ons to the HTML 4 standard.
Some of the more important changes incorporated in HTML 4 are described below :
The information presented in this section and in the following are rewritten from [34] by the author.
CSS stands for Cascading Style Sheets, and is a relative new standard, in terms of use, that is supposed to complement the HTML standard. Pages on the WWW are written in HTML, but a wish from the founders of the HTML standard (W3C) is to separate the design specific issues from the information. They want that the HTML-tags to describe the information in the document such that the web-browser is able to read the document the way that best suits the reader. Independently of which system the reader uses, the information should be available.
W3C has been working on the CSS standard since 1994 but the first release was published in September 1996, css compatible browsers are now common. In 1998, W3C published CSS2 and they are currently working on CSS3. These are expansions of the earliest standard.
The advantage with CSS is that it enables the writer to define how the HTML-document is supposed to be displayed. This can be done either directly in the HTML-document or in one or more separate files. By adding a framework for this in a separate file or in the top of the HTML-document the HTML code becomes tidier and easier to write and read. The HTML author does not need to write all the format specific tags and attributes in the HTML-code. An additional advantage is that it reduces the size of the HTML pages and makes them faster to load. CSS has some graphical advantages as well, the standard adds functionality for placing elements and specifying the distance between them. CSS is also designed so that it degrades gracefully, if a user has a browser that does not read CSS it should have no effect on the performance of the page, the page is simply displayed without styles.
CSS consists of a list of rules. A rule has two parts: a selector and a declaration list. The selector defines where the rule applies and the declaration list defines what formating that should be done on the element. The selector can be a specific HTML tag or tags identified by class or id attributes (explained later).
The CSS rules can be stored in a separate file or in the HTML document. How this is done will be explained later.
A CSS-rule is format as shown in algorithm 10 and the declarations take the form of property : value as shown by the example in the same algorithm. The properties declared in the example will be applied to all headings of type 1 in the HTML document that this CSS rule is linked to. An important property of CSS is that elements inherit from each other, an element that exists entirely inside another element inherits the styles from the parent element and adds its own rules.
As can be seen in the algorithm above, you can place comments in the CSS-code in the same way as it is done in Java - code.
There are three ways of linking CSS-styles to HTML-documents. The most common way to do this is described in element number 4 in algorithm 10. This TAG links a CSS-file to one or several HTML-documents. It has several parameters and must be placed inside the HEAD tag of the HTML document. Another way to do this is to write the CSS-styles inside a STYLE tag construction at the top of the HTML document, as done in algorithm 11.
The last option for linking CSS-rules to HTML-code is to add a STYLE attribute to the HTML-tag in the code, but this is not the recommended way of doing it, because it eliminates many of the advantages of using CSS.
With CSS there is also the possibility of using the CLASS or ID attribute in HTML to style different instances of the same HTML-tag with different properties as described in part 5 and 6 of algorithm 10. The class and id tags are defined in HTML 4.0 and must be used in slightly different manners, but they result in much of the same functionality.
The box model is a description of how the elements and their suroundings can be described and modified using CSS2. The information and figures presented here are collected from the Cascading Style Sheets, level 2 CSS2 Specification found at the following website http://www.w3.org/TR/1998/REC-CSS2-19980512.
Each box has a content area and optional surrounding padding, border, and margin areas; the size of each area is specified by properties defined in figure 7.8.
As shown in the figure above, the margin, border, and padding can be broken down into left, right, top, and bottom segments. These can be given individual properties in the CSS document, and together they make the css style for the element. The website developed in this thesis uses the box model to place the elements on the site. For example the right handed menu and the logos placed in the menu are all placed using margins, borders and padding properties. For a more thorough explanation of the model the reader should refer the CSS2 specification document.
The PHYREX documentation consists of the following elements :
It is important to perform testing and validation on a software system. This can be done using different models and the choice of model is often based on the model used to design the system (section 5.1). This section will describe the testing and validation done on PHYREX and present the theory behind the models used. Testing is often divided into three main tasks; the testing itself, the validation of the program and the verification of the program. This section draws heavily on Sommerville [28] which covers in detail software engineering, testing, verification and validation.
There are many methods for testing a program. The method chosen is often correlated with the system and / or the model used for system engineering. Since PHYREX is developed in Java and uses the component based modelling method, the test model of choice is the Object oriented testing model.
Object oriented testing involves four different types of tests:
Verification and validation are often interpreted as two words with the same meaning. In the context of testing software system they do not have the same meaning. ``verification'' means to check that the system performs as specified and ``validation'' means to check if the system meets the demands of the users, i.e. if it solves the problems it was designed to solve. Both verification and validation are done by comparing the system to the specification, but with different parts.
PHYREX has been tested in all four categories specified in section 7.4.1. This was done in the following way:
Since PHYREX should be used on a strictly limited set of data and run as a standalone application no explisit performance testing was done on the system. Some thoughts about what results one could expect by such tests are given here. Since the system relies on externally developed components that have to be accessed by executing them separately it is expected that the majority of time used by the system would go to perform these operations. The system performance would therefore depend on the speed of the external components and the framework used for executing these. Several executions of the system on test data sets have shown that 80 - 90 % of computation time is used on these tasks.
The validation of the software was done by showing it to the users and listening to their reponse. This was done by presenting the software at group meetings and by presenting documents to the users. Based on feedback from these presentations PHYREX is found to meet the requirements and has great potential for expansion through incorporation of other methods for analysis and calculation of other input data.
The system verification is done by the developer himself by comparing the software to the specifications developed in the early stages of the development. The system is found to perform well in all functional and non-functional tasks. Other methods to perform the tasks specified, such as sequential reading of files instead of using hashtables, were considered but were found to be a weaker solution to the problem than the existing one.
DNA microarray technologies can monitor the expression levels of thousands of genes on a genomic level. This makes it possible to meassure a large number of expression differences within and between species. Studies like [18] and [36] raises questions about how to explain the differences in gene expression between species. Others again [8] states :
``... They furthermore suggest that such changes have been particularly pronounced during recent evolution of the human brain. The underlying reasons for such expression differences are likely to be manifold, for example, duplications and deletions of genes, promoter changes, changes in levels of transcription factors and changes in cellular composition of tissuees. A challenge for the future is to investigate the relative contribution of these factors to the expression differences observed.''By using the data collected by Enard Et.al. [8] I have developed a framework for studying the relationship between changes in upstream regions of a gene and the change in gene expression of the same gene in a phylogenetic perspective. The results of this analysis is presented with the constraints pointed out in section 6.5.
Although humans and their closest evolutionary relatives are well over 95% identical in their genomes, they differ in many ways. Since the genome has such a high degree of similarity one explanation of the differences can be found in the differences in gene expressions between the species. Differences in the susceptability to AIDS, malaria and Alzheimer's disease is only some of the traits that can be viewed in the context of different gene expression levels instead of structural changes. Studies such as [8,18,36] and others have studied these and other correlations. In this thesis I would like to take the ``problem'' one step further and study the expression levels in comparison with structural differences represented by mutations in the upstream regions of the genes meassured.
The results from [8] shows that a large number of quantitative changes in gene expression can be detected between closely related mammals. The authors furthermore interpret their results to have a different rate of change in different tissue samples. E.g. the brain samples shows a higher rate of change than the liver samples indicating that the changes can be selective and has in the later stages of evolution pronounced the human brain. As stated in the citation above these changes can come from many different sources such as e.g. promoter changes.
Since there, as pointed out above, exists a clear relationship between the gene expression level of a gene and it's biological function one can by the introduction of microarray experiments monitor the expression levels of thousands of genes at the same time to find such relations. This can be the case for genes that responds to different environmental stimulies, but when studying the evolutionary change in gene expression one has to determine the ammount of changes that is due to darwinian selection or just a product of stochastic processes. There are several studies that tries to solve this problem, e.g. [18], which concludes that the wast majority of gene expression differences within or between species are not functional adaption but nearly neutral.
With the ongoing sequencing projects on both human and some of the great apes the ``technological'' gap of comparing the genomes is narrowing. This enables new experiments and analysis to take form. In this thesis a genome wide comparison of sequences from upstream regions and gene expression profiles from the same genes is performed. The motivation for doing such a comparison can be found in [8] among others. By correlating a mutation rate with a log ratio of changes in gene expression one can use both sources to identify sites where a possible change in the biological function between two closely related species occur. Theese candidate sites could be filtered out and furthered analysed. The task of this thesis is to develop a high troughput framework for identifying these candidates.
When developing this framework the hypothesis of to what degree the evolutionary changes should be explained by darwinian selection or just stochastic processes has been a major issue to handle. The whole idea behind the thesis relies on some kind of Darwinian selection to take place, or at least not a neutral theory.
As mentioned earlier the testset was originally designed in [8] to study the relationships between higher primates to see if the evolvement of genes had a differen growth factor in different tissues in the organism. The work of Enard et.al present some results that gave rise to the questions I've tried to solve during the work with this thesis. The differences in the factors the gene expression change in different tissues points in the direction of a Darwinian selection theory.
The numbers and figures presented to you in this chapter are all collected from a thorough analysis of the dataset generated by PHYREX using a phylogenetic tree of [[chimpanze,human],orangutang] with the gene expression profiles denoted as sample 1 from the brain of chimpanze 1, human 1 and orangutang 1 in the dataset distributed in [8]. Using this dataset I have reconstructed ancestral gene expression values at the internal nodes of the tree. The promoter sequences for these genes were collected from the EnsEmbl database (100 bp upstream of gene start site). One important issue to take notice of is that the sequences of orangutang is swapped with the sequences of mus musculus since there exists no complete sequenced genome of the orangutang yet. The sequence data was aligned by use of ClustalW and the last common ancestor sequences was calculated using BASEML from the PAML package.
By swapping the orangutang and mus musculus sequences the analysis of the edge between the ancestral node of human and chimp and the root node, and the edge between the root and orangutang is not possible to perform. The analysis of the subtree containing human and chimp can be performed. The numbers and figures presented here are collected from this analysis.
The Statistical Algorithms Description Document described algorithms for normalizing samples against each other on a probe pair level. In PHYREX, however the normalization is done on gene expression profile level.
The data found on the AffyMetrix GeneChips can be described as extreme value distributions, See Figure 8.1.
To normalize the extreme value distributions described in Figure 8.1, the middle value of the two distributions is calculated and subtracted from each other. This is done to make the values in each distribution more comparable to each other.
When giving input data to PHYREX the user selects on eof the leafs to guide the normalization. The middle value of the gene expression distribution of that species is then calculated. Then the middle values of the gene expression distributions of all the leafs are calculated and subtracted from the one of the selected node, the result is a constant value that can be added to the leaf distributions to move them closer to the selected one. This is done one time for each pair (selected node - leaf). The normalization is done on the data that are read out of the Affymetrix GeneChips before reconstructing ancestral gene expression states. In PHYREX this is done as described in Algorithm 12.
Much of the work presented here are done by methods in the GeneFilter module in PHYREX. The GeneFilter module is of now not a general module and depends on the species/branches in/of the tree under study. The analysis decribed here is performed on one branch of the tree, meaning that it has to be performed multiple times to cover all branches.
To guide the branch analysis a distribution of the sequences based on the change in gene expression along the branch is constructed. As figure 8.2 points out one can based on this distribution pick out two groups of sequences, middle candidates and edge candidates. The middle candidates are the ones with a small change in gene expression and the end candidates are the ones with high gene expression change, either negative or positive. The end candidates is collected as one group, but annotated to see if they are upregulated or downregulated.
The groups are then filtered with a mutation treshold. This is done to filter away candidates that has many mutations in the alignments, since these most likely cannot be trusted. The homolog sequences from the ancestral node and the sibling is collected, aligned and the position of the mutations is calculated.
The distribution of gene expression changes for the two branches of study is seen in Figures 8.3 and 8.4. Along both the chimpanzee and human branch from their common ancestor I have isolated 67 (1%) candidates from each of the ends and 134 (2%) candidates from the middle of the distribution, see Figure 8.2.
To guide the selection of candidates I calculated a ``higher than'' treshold for the upregulated candidates and a ``lower than'' threshold for the down regulated candidates, and a interval to choose the middle values between. Described as the red and blue vertical lines in figure 8.2. For the analysis of the testset from Enard Et.al. these limits are as described in Figure 8.1. All numbers are log ratios.
The mutation treshold was set to 5 along both lineages, meaning that sequences that had diverged more than 5 % from the original sequence, collected form Ensembl was not selected due to the unatural large amount of change along the branch. The mutation treshold does not consider all types of gaps.
Gaps can not be viewed as a mutation since there exists a lot of gaps in the sequences to align these properly and most of them exists as end-gaps where there is very little information about if the sequence could be better aligned if a larger amount of basepairs were present.
There exists different types of gaps in the alignments used in PHYREX, some that are counted as a mutation and others that are not. Gaps are processed after the following rules :
I found the promoter sequence substitution rate to be similar in both the test and control sets and an almost similar percentage of sequences passed the 5% divergence treshold. For the remaining sequences, the analysis concists of using TESS-Transcription Element Search Software [31] to search the TRANSFAC database. TESS takes a sequence as input and search TRANSFAC for transcription factors that can be locally aligned with regions of the input sequence. The output from TESS concists of a list of transcription factors that match the input sequence and the position and length of the transcription factor. These lists are manually controlled for each input sequence and correlated with the information on mutations calculated by PHYREX.
If TESS aligns a transcription factor binding site such that there exists a mutation, along one of the lineages under study, inside the transcription factor it gets annotated. When all mutations are annotated, distributions are made for both the middle control group and the end candidates and a average mutation rate is calculated and presented with standard error, see Table 8.2.
My results points out that the chimpanzee lineage shows significantly more substitutions in transcription factor binding sites for the genes that are most up- or downregulated. Such observations are also done along the human lineage.
xxxxxxxxxxxxxxxxxxxxxxxxxx(MORE HERE)xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Along the human lineage, Tables 8.3 and 8.4 shows the genes that are most upregulated and most downregulated in the Enard Et.al. dataset together with the transcription factor binding sites that have undergone substitutions along the lineage.
Tables 8.5 and 8.6 shows the same results along the chimpanzee lineage.
These binding sites are candidates for playing an important role in the lineage-specific divergence of human and chimpanzee and should be investigated further. xxxxxxxxxxxxxxx(MORE HERE)xxxxxxxxxxxxxxxxxxxxxxxx
Following speciation, there are many possible molecular events that can drive the divergence of species. One of the most important mechanisms is changes in the regulatory regions that affect gene expression. Another similar mechanism is changes in the regulatory regions that affect mRNA splicing. No systematic approach to examine these mechanisms has been performed, because appropriate methods and datasets are lacking.
With the progress of different genome sequencing projects several preliminary datasets now exists. For example yeast XX and primates [8]. Even if they are preliminary they could serve as a starting point to enable testing of methods.
My approach consists of constructing ancestral gene expression profiles and correlating this with the analysis of actual regulatory sequences along the lineages that show a significant change. Analysis (Section 8.2.3) and results (Section 8.2.4) points out that the method described here are well suited to pick out candidates where such effects may have occured.
The goal of this project was to construct a framework for generating ancestral gene expression profiles, by use of a minimum evolution approach, and examine lineages that show a significant change by correlating them with an analysis of changes in transcription factor binding sites.
Much work has already been done in the fields of Gene Expression analysis and sequence reconstruction, but very litle is done combining the two. The framework developed can be divided in three tasks :
The framework also enables the user to isolate candidates for further analysis by external databases and search systems like e.g. TESS.
The results and methods are published here and submitted to BMC BioInformatics early 2005.
Some remarks about further work. As mentioned in the introduction of this thesis it would be preferable to make the program work for both gene expression profiles and the use of alternative splice sites. Recomended further works would therefore include making the program more general to incorporate the use and calculations of alternative splice site usage. Another more program specific task to work on in the future would be a more tight incorporation with the externaly developed programs and methods. More methods of analysis and / or normalisation could also be included in the program at a later time. As of now the program use input data from Ensembl and AffyMetrix, it would also be preferable to make it able to handle other input formats like for example cDNA arrays, UniGene entries etc.
Please note that I cannot guarantee that the www universal resource locators given in this thesis will remain valid. I have tested these adresses, but because of the rapid modification of web resources, it is highly possible that they will change in the future.
Also some of the works refered in this thesis wasn't published at the time of printing. I cannot guarantee that these will be published, meaning that some of the references may be invalid. To deal with this problem I will try to maintain a printed or electronic copy of these documents, for documentation reasons.
© Copyright 2003 - 2004 by Roald Rossnes. The software package is provided "as is" without warranty of any kind. In no event shall the author or his ``employer'' be held responsible for any damage resulting from the use of this software, including but not limited to the frustration that you may experience in using the package. The program package, including source codes, example data sets, executables, and this documentation, is distributed free of charge for academic use only. Permission is granted to copy and use programs in the package provided no fee is charged for it and provided that this copyright notice is not removed.
``As genomes evolve under selective pressures, not only do protein coding sequences evolve, but the gene regulatory sites effecting expression profiles and alternative splice site usage evolve. To understand the evolutionary history of gene and genome functionality and the selective pressure that have effected the evolution of genomes, it is desirable to reconstruct the ancestral state of gene expression and splice site usage. At a first level, Parsimony and maximum likelihood approaches to these continuous traits will be developed. These will be correlated with an analysis of the actual regulatory sequence reconstruction done with equivalent methods (parsimony and likelihood), where such information is available. Ultimately these approaches can be applied to large scale datasets.''This specification was written in cooperation with David Liberles at BCCS - CBU at the University of Bergen and served as a foundation in the process of describing the system to be developed. It developed into the description stated below:
A tool for constructing ancestral values by use of a Minimum Evolution approach in a given phylogenetic tree is to be constructed. The input is data from microarray experiment and database retreival of correlating sequences. The program should have the following functionality :
The development of the program forced me to make some changes in the specifications during the implementation stages. This was changes like e.g. changing the externally developed module for sequence reconstruction or changes in the procedures of parsing the different file formats. None of these changes led to the program not solving any of the specified goals.
© Copyright 2003 - 2004 by Roald Rossnes. The software package is provided "as is" without warranty of any kind. In no event shall the author or his ``employer'' be held responsible for any damage resulting from the use of this software, including but not limited to the frustration that you may experience in using the package. The program package, including source codes, example data sets, executables, and this documentation, is distributed free of charge for academic use only. Permission is granted to copy and use programs in the package provided no fee is charged for it and provided that this copyright notice is not removed.
As genomes evolve under selective pressure, not only do protein coding sequences evolve, but the gene regulatory sites affecting expression profiles and alternative splice site usage also evolve. To understand the evolutionary history of gene and genome functionality and the selective pressures that have affected the evolution of genomes, it is desirable to reconstruct the ancestral state of gene expression and splice site usage. At a first level, I have developed a Minimum Evolution approach based on the use of gene expression profiles. The approach is implemented to work with large scale datasets like microarrays, e.g. the Affymetrix genechip technology, and is correlated with an analysis of the actual regulatory sequence reconstruction done by similar methods, where such information is available.
My objectives with this research is to highlight the changes made by selective pressure by use of Minimum evolution methods to reconstruct continous data at the ancestral states in a phylogenetic tree.
To reconstruct the ancestral states of the continuous data traits I have developed a brute force algorithm that constructs an interval of allowed values on each internal node in a phylogenetic tree (Schreiber format), and chooses the best value to represent each node. The algorithm runs trough the tree two times, hence an order O(2n) time complexity. In the first run the intervals of allowed values are constructed and in the second run the intervals are narrowed and the representing value is chosen. This is done for every gene represented in the large scale data set.
An organism can by accumulating substitutions divide into two closely related organisms. By comparing sequences from a set of homologous genes from different species it is possible to find the sequence of their closest common ancestor. Algorithms that do such calculation on sequences have allready been developed. In this thesis ClustalW and BaseML are used. ClustalW is used to align sequences found in the leaf nodes of the tree and BaseML is used to calculate sequences at the ancestral nodes of the phylogenetic tree based on the alignment done by ClustalW. The sequences used are the upstream regions of the genes collected from EnsEmbl. Our work has also produced an algorithm and methods for constructing ancestral gene expression profiles. and a framework that can be used to display and compare them to the sequence calculation done as described above.
Simultaneous reconstruction of regulatory sequences and expression profiles reduces the signal to noice ratio by using long branched significant classes to correlate substitutions with functional effects. By comparing the two calculations mentioned above and by isolating candidate genes where a clear change in gene expression is correlated with a high number of point mutations in the upstream region, the signal to noice ratio is reduced.
PHYREX is distributed as a zipped catalog structure. For the program to run properly it is dependant of an internal catalog structure in the software package. This means that the catalog structure that is constructed in the program download file must be maintained when unzipped onto your local system. This is important because the program use a static catalog structure below the root directory (EvolutionFolder).
PHYREX is made to work under a Linux environment with an installation of a Java Virtual Machine. The first thing to do is to check if these properties is fulfilled.
Since the program is running on a Java Virtual Machine you could think that it should be platform independent. This is not correct since the program does perform some platform dependent calls to the operative system and because of the CLisp module used by the program is implemented in a Linux binary file.
PHYREX is developed and runned under both whitebox and redhat linux.
Enclosed with PHYREX there is a treeconverter that convert trees from Schreiber to Nevick format and the other way around. The treeconverter is developed in CLisp and compiled in to a Linux binary file that is run by a script file. This makes the treeconverter platform dependent but it does not need a CLisp compiler. If you want some documentation on CLisp you can find some information on http://www.cons.org/cmucl.
PHYREX is using externally developed software like PAML and BaseML. To make these programs available to the program the path to the executable files must be set in the path. I'm currently running a whitebox linux version which is using a bash shell. Under this configuration I use the .bashrc file in my root catalog to set the PATH and other variables. The file must at least contain these declarations :
export JDK_HOME=/Home/stud/roaldr/java/j2sdk1.4.2_06/lib/
export JAVA_HOME=/Home/stud/roaldr/java/j2sdk1.4.2_06/
exportPATH=./:/Home/stud/roaldr/hfoppg/evolutionFolder/clustalw1.83/:
/Home/stud/roaldr/hfoppg/evolutionFolder/bnb/:/Home/stud/roaldr:
/Home/stud/roaldr/java/j2sdk1.4.2_06/bin/
export CLASSPATH=./:/Home/stud/roaldr/database/hsqldb/lib/hsqldb.jar:/Home/stud/
export CVSROOT=/net/bccs/cbu/cvsroot xset b off
This is just example data and the users has to set up their own catalog structure and path files.
PHYREX is developed using Java 1.4.2 and newer API. This does not mean that it only works using this API, but I cannot guarantee that all functions will compile using older APIs. I therefore recomend an installation of Java 1.4.2 or newer berfore using PHYREX. Java API's can be obtained from http://www.sun.com.
When this is done and all the above mentioned resources and properties is collected PHYREX should be compiled this can be done by entering the root folder and runing:
javac rrevolution/*.java
java -Xmx250m rrevolution/Rrevolution
Since PHYREX is using data collected from many different sources it depends on a variety of different file formats. The file formats are often described by the organization constructing them or based on other formats that are well documented.
Schreiber vs. Newick
Trees as a datastructure is normally represented as a string of characters encapsulated by various kinds of brackets constructing the branches between the nodes, normally represented by a node number e.t.c. Different formats have different ways of formating the string.
The Schreiber tree format is very flexible and can be expanded by just adding tags, and can therefore be used to store information in a tree structure, a kind of a low level XML code. It makes use of square brackets and commas (``[``,'']'','','') to build the tree structure. Meaning that each external node is represented by a number and each internal node is represented by a pair of paranteses that enclose all the decendents of that internal node.
For instance in figure B.1 the internal node [1] would be represented like [a,b] in the string representing the tree.
The tree in the figure above is represented by this string ``n:-[[a,b],[c,d],[e,f],[[1],[2]],[[3],[4]]]''. The string is read from left to right constructing parent nodes for each pair of parantheses that are closed and iteratively giving the parent nodes numbers. When all the leaf nodes are combined, parent nodes are combined in the same fashion as the leaf nodes by pairing their number (enclosed in brackets) inside another set of square parentheses. In general, one could describe the format by giving each external node a number related to the position of the node, and each internal node could be viewed like a list of children denoted [X], where X is a number and [X] is a list of numbers separated by commas (``,''). Only the childnodes of a internal node must be combined before the internal node could be constructed. Meaning that ``n:-[[a,b],[c,d],[[1],[2]],[e,f],[[3],[4]]]'' would also be a legal statement, since both internal node 1 and 2 are constructed one can construct internal node 3 berfore combining external nodes e and f. This would result in the tree shown in figure B.2.
The newick format is a more common format for representing trees. This format uses regular parantheses and a nesting of these to represent the tree stuctures. A representation of the tree in figure B.1 in newick format would be ``((((a,b),(c,d)),(e,f)));'' All newick formated trees must end with a semicolon (``;''). There exists an extension of the newick format annotated NHX. This extension is oftens used to represent annotated trees because of the tags introduced in the format. This is a more flexible solution than the original format, but not as flexible as the schreiber format. A more complete description of the newick format is available from Joseph Felsenstein's website.
As mentioned earlier PHYREX have a treeparser that can convert trees between the two different tree formats, but since the treeparser is implemented in CLisp and system dependent one could get around it by maintaining a copy of the tree in both Schreiber and Nevick format.This needs knowledge of the Schreiber format. The Nevick Format is the same format that is used in the Phylip package and documentation can be found with a simple google search on nevick and/or phylip. The Schreiber format is not so much used and can be harder to find information about. The suggested way to deal with this is to use the treeconverter to make the schreiber tree based on a nevick tree, this can be done as following :
treeconverter <nevick_tree> <output_file> <.>
if you are in the converter catalog in the catalog structure.
Sequences used in the program are collected from the EnsEMBL database and written in a FASTA - like format. One needs to collect one file per species and one file containing a list of homologs between all the species from EnsEMBL. A procedure on how to collect these files can be found in Section B.3.2.2.
Ensembl is freely accessible and consists of several levels from which data can be collected. Ensembl can be accessed via an html interface, a java API, a batch query tool (EnsMart) and trough different tools that can be locally installed on your computer. The data can be retrived as flat files, FASTA formated files or raw dumps. All these different ways of accessing Ensembl make it extremely flexible and easy to work with. In this project the user can access Ensembl through EnsMart.
The fasta format is a specific way to format database retrievals. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.
EnsMart is an EnsEmbl tool that allows the user to run batch queries against the EnsEmbl database. It is structured to allow genome queries and therefore a very useful tool in this project. This section describes how the user should use EnsMart to retrieve the data from the EnsEmbl databases in a manner that suits this program. There are two different data files that need to be downloaded from the EnsEmbl database.
First, the user has to collect a file containing all the sequences of the upstream regions of the genes in study. This can be done by following the recipe described in Algorithm 13.
After collecting sequence files for all the species the user must collect a file containing ``maps of homologs''. This is a mapping of EnsEmbl id's of the genes from each species that are homolog with each others. This can be done by following the recipe described in Algorithm 14.
A more thorough explanation of Ensmart and EnsEmbl in general can be found on their website http://www.ensembl.org.
The system input files have different formats, some of them are species dependent and exists in many copies, others follow the analysis and exists in only one copy. The main provider of these files are the Affymetrix experiment and the EnsEmbl database.
From the AffyMetrix experiment, the following files must be collected:
The <name>.cdf file holds information about the structure of the genechip used in the experiment. It maps the perfect match and mismatch probes against each other to construct a probe pair and specifies the ``algorithms'' used to construct the probe sets. The .cdf file is formated in a special tab separated format developed by AffyMetrix and needs a specially developed parser to be read properly. For each experiment to be analysed in PHYREX one .cdf file is needed. The file uses a name convention that specifies the GeneChip it is developed for. In the test set used in this thesis the file is called HG_U95A.cdf
The .cel - files holds the values for each probe on the GeneChip. The values are represented with a number, positive or negative, describing the amount of target DNA that did bind to the probe. To run a PHYREX analysis you need one .cel file per species in the analysis (one per leaf node in the tree). The file is tab separated following a similar but not identical format as used in the .cdf file and needs a specially developed parser.
The annotation file either follows the AffyMetrix experiment or can be downloaded from the AffyMetrix wesite (http://www.affymetrix.com). It contains a list of all the id's used on a GeneChip and maps them to other id's in public available databases such as EnsEmbl. The file is formated as comma separated values and follows a similar naming convention as the .cdf file (testset : HG_U95A_annot.csv) This file can be obtained in other formats as well, but support for these formats are not developed in PHYREX.
The graphical user interface of PHYREX is divided into four parts: the command panel, the tree panel, the node panel and the progress panel. I'll give a quick introduction to all three panels here and a description of what they contains.
The description below is done separate for each button or component in the GUI, some components must be initialised before others to make the program work as intended. By following the consecutive order the components are described in below these interactions will work.
The command panel is the menu system of PHYREX. In this panel almost all the user interactions and tuning of the system is done. The command panel is divided into four parts each taking care of different types of user interaction.
The input part of the command panel takes care of the input needed to set up the analysis and initialize the tree and node panels.
The Affy button is used to load the AffyMetrix annotation file into the system. It starts processes for reading and parsing the file into a local ``database'' that is used later to populate the leaf nodes with data.
When clicking the CDF button the user is asked to specify the path to the CDF file of this analysis. The CDF file contains information used by PHYREX to set up the system for reading in gene expression data from the microarray experiment data files.
The Map button is used to specify the file containing the map of homologs used by the system. This file contains information on wich genes in the leaf nodes that are homolog to each other. The file can be obtained as described in Section B.3.2.3.
To guide the reconstruction process done by PHYREX the system needs a phylogenetic tree. By clicking on the Tree button the user specifies a path to the tree file. The tree file must be in Schreiber format, and is parsed immediately after loaded. When the tree is loaded the Tree panel is initialized and made active.
This part of the command panel is used to set different propeties used by the system when reconstructing sequences or analysing the results.
The Paml button opens a panel that can be used to tune the BaseML parameters used in the sequence reconstruction. The tuning of these parameters assume knowledge of the BaseML system, the user is refered to the PAML user manual.
The Analysis button is used to give parameters to the analysis that should be performed after the reconstruction and population of the internal nodes of the tree is done. In the same way as the Paml button it opens a panel containing all the porperties that can be altered.
This part of the command panel is used to perform the population of the tree and to analyse the results. Before running these buttons all input data must be collected both in the command, tree and node panel.
After all the data is collected the Calculate button performs the population and reconstruction algorithms for both gene expression values and sequence data.
The analysis button is used to isolate candidates for further investigation by e.g. TESS. It uses the parameters given by the analysis properties given in the settings section of the command panel.
The Menu consists of Save, Load, Properties and Exit buttons.
The Load button can be used to retrieve previously calculated results to perform new analysis on them.
The Save button saves the calculations in an XML - like format.
Not active in this version of PHYREX.
Exits the system.
The Tree Panel becomes active after the Tree button in the Command Panel has loaded a tree in Schreiber format. The tree displayed is used to populate the leaf nodes in the tree with data. Before running any calculations each leaf node must be visited and data must be loaded in the Node panel for each leaf.
In addition the user must specify one internal node to analyse, See Node Panel Section.
The Node panel takes care of the display and collection of leaf specific data. For each analysis some files must be collected for all the leafs in the tree, this is done by the Node panel.
The Cel File button is used to specify the path to the file containing micro array experiment values for the leaf. This is stored in the .cel files collected from Affymetrix.
The Seq File button specifies the upstream region sequences found from EnsEmbl for each species.
The button is used to specify the species used to guide the normalisation of the gene expression data. Normally this is the species the AffyMetrix GeneChip is designed for. Only one leaf should have the value ``Yes'' in each calculation.
The Analysis button is used to specify the tripplett that should be analysed. Only internal nodes may have the value ``Yes''. The analysis is performed on the node specified and its two childs.
The Gene Table button displays the calculation results of this node. It gets active after the calculation is performed.
The Edge Table button displays the changes along the branch between this node and its parent.
The Progress panel is used to display the progress in the calculations after the user has clicked on the calculation button in the Command panel. The calculations take some time and the Progress panel displays an approximation of how much work that has been done.
The purpose of this User Maual is to give the user a brief introduction to how PHYREX work and how to install it. For documentation on how to install other software that PHYREX needs to work I refer the reader to that particulare software user manual.
A more thorough documentation of PHYREX is available in the thesis document found on http://www.rossnes.org/phyrex.
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers /tmp/lyx_tmpdir762zgjWIF/lyx_tmpbuf0/thesis.tex
The translation was initiated by Roald Rossnes on 2005-02-23