A pan-genomic investigation of the contributions of LTR retrotransposons to gene evolution and genome structure
DeBarry, Jeremy Daniel
MetadataShow full item record
Higher eukaryotic genomes are typically large, complex and filled with both genes and multiple classes of repetitive DNA. The repetitive DNAs, primarily transposable elements, are a rapidly evolving genome component that can provide the raw material for novel selected functions and can also indicate the mechanisms and history of genome evolution in any ancestral lineage. Here it is shown that approximately 1.5% of mouse (Mus musculus) genes contain LTR retrotransposon (LRP) sequences. Consistent with earlier findings in C. elegans, D. melanogaster and H. sapiens, LRPs are more likely to be associated with newly evolved genes. Evidence is presented that LRPs are often recruited as novel exons or as spliced additions to existing exons. These novel gene configurations may be expressed initially as alternative transcripts, providing an opportunity for the evolution of new gene function(s). Despite their abundance, universality and significance, genomic repeats have received limited analysis except in fully sequenced genomes. In order to facilitate a broader range of repeat analyses, the Assisted Automated Assembler of Repeat Families (AAARF) algorithm was developed. AAARF identifies sequence overlaps in small shotgun sequence datasets and walks them out to create long pseudomolecules representing the most abundant repeats in any genome. Testing of this program in maize indicated that it found and assembled all of the major repeats into one or more pseudomolecules, including coverage of the major LTR retrotransposon families. Both Sanger sequence and 454 datasets were appropriate. Application of the AAARF algorithm allowed for the classification of high copy number repeats in four agriculturally important grass species: Saccharum officinarum, Sorghum propinquum, Panicum virgatum and Pennisetum glaucum. Previously, such an analysis was not feasible because of a lack of assembled genome data for each species. Classified repeats were found to include known transposable element families, genomic structural repeats (ribosomal, centromeric and satellite), and novel repeat families for each species. A phylogenetic analysis of the coding regions of the LRPs identified in each family indicates that these elements are often more closely related to elements in other species than to elements in a single genome. This suggests that most or all of these LRPs are transmitted vertically from generation to generation.