Counter based suffix tree for DNA pattern repeats

Tshepo Kitso Gobonamang, Dimane Mpoeleng

Research output: Contribution to journalArticle

Abstract

In recent years, the string datasets have increased exponentially, so is the need to process them. Most of these datasets have been deeply rooted in the field of bioinformatics since the entire characteristics of any living organism is encoded in their genes. Genes consist of nucleic bases which will, therefore, makeup the entire genome. A genome is made of a concatenation of different types of nucleic bases. To efficiently extract the information encrypted in these sequences there is a need to use algorithms to decrypt it. Most available methods use the data structure commonly referred to as the suffix tree. They have tremendously evolved over the years, and the on-line construction of the suffix tree is deemed as the best data structure, however, it is not optimal when it comes to finding repeated sequences because of many traversals algorithm will have to do when identifying repeats. To improve the speed and of finding repeats we developed a counter based suffix tree algorithm. Our work presents a novel algorithm of constructing a counter based suffix tree without losing its properties. The counter based suffix tree time complexity is θ(n) where n represents the length of a string. Which is the same as the fastest suffix tree implementation. We have shown that the counter based suffix tree will reduce the search time when identifying repeats. We have proved that a counter based suffix tree can be developed during construction.
Original languageEnglish
JournalTheoretical Computer Science
DOIs
Publication statusPublished - 2020

Fingerprint

Suffix Tree
DNA
Genes
Data structures
Trees (mathematics)
Bioinformatics
Data Structures
Genome
Strings
Entire
Gene
Concatenation
Tree Algorithms
Time Complexity

Cite this

@article{8d3b28fe912a48d1b2dc81db3a0968c2,
title = "Counter based suffix tree for DNA pattern repeats",
abstract = "In recent years, the string datasets have increased exponentially, so is the need to process them. Most of these datasets have been deeply rooted in the field of bioinformatics since the entire characteristics of any living organism is encoded in their genes. Genes consist of nucleic bases which will, therefore, makeup the entire genome. A genome is made of a concatenation of different types of nucleic bases. To efficiently extract the information encrypted in these sequences there is a need to use algorithms to decrypt it. Most available methods use the data structure commonly referred to as the suffix tree. They have tremendously evolved over the years, and the on-line construction of the suffix tree is deemed as the best data structure, however, it is not optimal when it comes to finding repeated sequences because of many traversals algorithm will have to do when identifying repeats. To improve the speed and of finding repeats we developed a counter based suffix tree algorithm. Our work presents a novel algorithm of constructing a counter based suffix tree without losing its properties. The counter based suffix tree time complexity is θ(n) where n represents the length of a string. Which is the same as the fastest suffix tree implementation. We have shown that the counter based suffix tree will reduce the search time when identifying repeats. We have proved that a counter based suffix tree can be developed during construction.",
author = "Gobonamang, {Tshepo Kitso} and Dimane Mpoeleng",
year = "2020",
doi = "10.1016/j.tcs.2019.12.014",
language = "English",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",

}

Counter based suffix tree for DNA pattern repeats. / Gobonamang, Tshepo Kitso; Mpoeleng, Dimane.

In: Theoretical Computer Science, 2020.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Counter based suffix tree for DNA pattern repeats

AU - Gobonamang, Tshepo Kitso

AU - Mpoeleng, Dimane

PY - 2020

Y1 - 2020

N2 - In recent years, the string datasets have increased exponentially, so is the need to process them. Most of these datasets have been deeply rooted in the field of bioinformatics since the entire characteristics of any living organism is encoded in their genes. Genes consist of nucleic bases which will, therefore, makeup the entire genome. A genome is made of a concatenation of different types of nucleic bases. To efficiently extract the information encrypted in these sequences there is a need to use algorithms to decrypt it. Most available methods use the data structure commonly referred to as the suffix tree. They have tremendously evolved over the years, and the on-line construction of the suffix tree is deemed as the best data structure, however, it is not optimal when it comes to finding repeated sequences because of many traversals algorithm will have to do when identifying repeats. To improve the speed and of finding repeats we developed a counter based suffix tree algorithm. Our work presents a novel algorithm of constructing a counter based suffix tree without losing its properties. The counter based suffix tree time complexity is θ(n) where n represents the length of a string. Which is the same as the fastest suffix tree implementation. We have shown that the counter based suffix tree will reduce the search time when identifying repeats. We have proved that a counter based suffix tree can be developed during construction.

AB - In recent years, the string datasets have increased exponentially, so is the need to process them. Most of these datasets have been deeply rooted in the field of bioinformatics since the entire characteristics of any living organism is encoded in their genes. Genes consist of nucleic bases which will, therefore, makeup the entire genome. A genome is made of a concatenation of different types of nucleic bases. To efficiently extract the information encrypted in these sequences there is a need to use algorithms to decrypt it. Most available methods use the data structure commonly referred to as the suffix tree. They have tremendously evolved over the years, and the on-line construction of the suffix tree is deemed as the best data structure, however, it is not optimal when it comes to finding repeated sequences because of many traversals algorithm will have to do when identifying repeats. To improve the speed and of finding repeats we developed a counter based suffix tree algorithm. Our work presents a novel algorithm of constructing a counter based suffix tree without losing its properties. The counter based suffix tree time complexity is θ(n) where n represents the length of a string. Which is the same as the fastest suffix tree implementation. We have shown that the counter based suffix tree will reduce the search time when identifying repeats. We have proved that a counter based suffix tree can be developed during construction.

U2 - 10.1016/j.tcs.2019.12.014

DO - 10.1016/j.tcs.2019.12.014

M3 - Article

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

ER -