Homology Based Sequence Alignment and Annotation Algorithms
General Material Designation
[Thesis]
First Statement of Responsibility
Amin, Mohammad Ruhul
Subsequent Statement of Responsibility
Skiena, Steven
.PUBLICATION, DISTRIBUTION, ETC
Name of Publisher, Distributor, etc.
State University of New York at Stony Brook
Date of Publication, Distribution, etc.
2019
GENERAL NOTES
Text of Note
97 p.
DISSERTATION (THESIS) NOTE
Dissertation or thesis details and type of degree
Ph.D.
Body granting the degree
State University of New York at Stony Brook
Text preceding or following the note
2019
SUMMARY OR ABSTRACT
Text of Note
Research in bioinformatics is driven to analyze and interpret biological sequences. Analysis of biological sequences begins with alignment, while their interpretation in terms of biological function begins with annotation. With the rapid development of high-throughput genome sequencing techniques, alignment and annotation methods are also evolving. In this thesis, we discuss the shortcomings of current alignment and annotation methods, and present novel algorithms with improved results. The Oxford Nanopore Single Molecule Sequencing technique generates long reads at higher sequencing errors. Popular alignment algorithms, such as LAST and BLAST take considerable processing time for aligning long reads at higher sensitivity, BWA-MEM has the smallest average alignment length and GraphMap aligns many random strings with moderate accuracy. We introduce a novel open-source read alignment tool, called NanoBLASTer, that includes several novel enhancements to maintain high sensitivity and high performance in the presence of high error rates. The advent of large-scale genome sequencing has proven a tremendous boost to research in the life sciences. However, published genomes have been shown to be very uneven in terms of both sequence and annotation quality, reducing dramatically in both aspects as we enter the long tail of non-model organisms. We present methods to identify massive numbers of prokaryotic sequence annotation errors in public databases and demonstrate that homology and pattern matching techniques can be deployed to solve them. In summary, we have re-annotated 12,495 16S rRNA 3' ends, increasing the total number of prokaryotes with 16S rRNAs containing antiSD sequences from 8,153 to 20,648, and increasing the number of organisms known to lack an antiSD from 15 to 128. Finally, we present DeepAnnotator, a deep learning method to solve the problem of genome annotation on a large scale. DeepAnnotator uses Recurrent Neural Network with Long Short-Term Memory to predict the start, stop and coding sequences of a gene and accumulates all those scores by a downstream algorithm to annotate genome sequences. DeepAnnotator establishes a generalized computational approach for genome annotation using deep learning and achieves an F-score of 94%.