Computational methods for deciphering genomic structures in prokaryotes
MetadataShow full item record
High-throughput sequencing technologies have generated huge amounts of genomic data. This wealth of genomic data provides computational biologists unprecedented opportunities to unveil the biological machinery encoded in genomes. Characterizing the structure of genomes is an important and challenging task; it is an essential step towards deciphering the networks and pathways in a biological system. The characterization of microbial genomic structures includes: (1) identifying neighboring genes that are co-transcribed (also known as operons); (2) identifying groups of operons with evolutionary relationships (also known as uber-operons); and, (3) elucidating higher level structures that share common regulatory controls, including protein-DNA binding events and cis-regulatory elements among operons (also known as regulons). The primary goal of this thesis is to develop computational methods for elucidating the above three categories of genomic structures in prokaryotes. UNIPOP, a maximum bipartite matching-based algorithm, is designed and implemented to predict operon structures of any prokaryotic genome, without relying on experimental data or training data. The prediction accuracy of UNIPOP is shown to be superior to most other operon predictors when evaluating two well-studied organisms. The evolutionary relationships among operons are elucidated by using comparative genomic data and a maximum matching-based algorithm. The comparative study of uberoperons and regulons has shown that they are highly related, indicating the effectiveness of using uber-operons for predicting regulons. With the availability of predicted operons, we propose an approach, phylogenetic footprinting for prokaryotes, to study cis regulatory motifs in the promoter regions of operons. By integrating the motif data with uber-operon data, and formulating it as a graph partitioning problem, we predicted regulons in Escherichia coli K12. Different sources of validation have shown that our predicted regulons were consistent with the data of known regulons, functional relatedness and expression data. More importantly, we have also derived some novel regulons which were biologically meaningful. In summary, we predict different levels of genomic structures by developing novel graphtheoretic based algorithms and using comparative genomic analysis. Our methods are universally applicable to all sequenced microbial genomes, and outperform most of the other published methods in terms of prediction accuracy. Our prediction tools can provide assistance in understanding the machinery of gene regulation, biological networks and pathways.