Skip to main content
x
Researchers Develop Game-Changing Gene Prediction Algorithms
Posted July 3, 2024

 

Clockwise from l to r: Conceptual diagram depicting the data processing flow in the GeneMark-ETP algorithm; Dr. Tomas Bruna – co-first author on GeneMark-ETP paper and co-author on BRAKER3 paper; Dr. Mark Borodovsky, senior author on GeneMark-ETP paper and co-senior author on BRAKER3 paper; and Alex Lomsadze – co-first author on GeneMark-ETP paper and co-author on BRAKER3 paper.

 

Graphic illustration by Katya Kouznetsova.

 

 

A team of researchers led by Georgia Tech’s Mark Borodovsky has developed two groundbreaking algorithms for gene prediction that eventually could lead to new treatments for disease as well as clean-energy fuels. These algorithms can generate maps of gene locations in eukaryotic species—animals and plants.

“Gene prediction is crucial for bioscience and biomedicine,” said Borodovsky. “Accurate gene maps are vital for identifying genes associated with diseases and discovery of new drug targets.”

Algorithms for gene prediction are foundational tools that enable in-depth understanding of genetic information. Genomic maps of gene locations generated by algorithms are necessary for gaining insights into evolutionary processes and the genetics of traits.

“They are required for downstream studies of gene expression and regulation, leading to a better overall understanding of how genes contribute to the functioning of cells and organisms,” said Borodovsky, Regents Professor in the Wallace H. Coulter Department of Biomedical Engineering at Georgia Tech and Emory.

The research team describe its work on the new algorithms — one called GeneMark-ETP and another, developed with colleagues from the University of Greifswald in Germany, called BRAKER3 — in papers published in the latest edition of the journal Genome Research.

“This is a milestone, to have two papers published in the same edition of a leading journal in genetics and genomics,” Borodovsky said.

 

Thirty Years of Discovery

Borodovsky’s lab has been developing the GeneMark family of algorithms since 1993, and they’ve been essential tools for the global community of researchers engaged in genome sequencing and annotation. Sequencing finds the order of nucleotides—A,T,C,G—in a DNA molecule. Annotation determines what those nucleotides do and where genes are located in the genome, which is the first step towards understanding the role and functions of genes. 

For decades, the National Center for Biotechnology Information at NIH has used GeneMark tools, which have evolved over time. Initially, they were used to annotate prokaryotic species (bacteria, archaea, etc.). The team’s first tools for genomes of eukaryotic organisms were introduced in 2005 with the field’s first self-training algorithm. This branch of GeneMark has continued to improve and evolve through the years. 

Now the lab has introduced GeneMark-ETP, a new automatic gene-finder designed to meet the needs of annotating the most challenging genomes that could be encountered within the Earth BioGenome Project, an international effort to sequence, catalog, and characterize all of Earth’s eukaryotic biodiversity. 

GeneMark-ETP, which combines data from genomes, transcriptomes, and proteins, has significantly increased the accuracy of gene prediction in large eukaryotic genomes, as described in the Genome Research paper.

“GeneMark-ETP remains an unsupervised algorithm, eliminating the need for prior training data,” said Tomas Bruna, a former grad student in the Borodovsky lab (now a genome data scientist with the Department of Energy) and one of the lead authors of the paper, with senior research scientist Alexandre Lomsadze (a member of the lab since 1998). “This makes it an ideal foundation for more complex gene prediction pipelines, like BRAKER3.”

 

Algorithm Integration

Borodovsky has organized a computational gene discovery workshop at the annual Plant and Animal Genome conferences since 2009. That’s where he met Mario Stanke, researcher from the University of Greifswald in Germany, where his team had been developing algorithms for gene annotation. 

The two researchers soon became ongoing collaborators in the development of cutting-edge algorithms over the past 10 years. Stanke and members of his research team at Greifswald are co-authors of the second paper in Genome Research, focused on BRAKER3.

“It has been a privilege to work with enthusiastic top-class scientists,” said Borodovsky, also a professor in Georgia Tech’s College of Computing.

BRAKER3 expands the work of the 10-year collaboration, combining GeneMark-ETP and the German team’s AUGUSTUS algorithm, and using an earlier designed algorithm called TSEBRA to merge the two and further improve accuracy. 

The research team tested BRAKER3 on the genomes of 11 species — including those with large, complex genomes — and demonstrated that it outperformed the existing tools. BRAKER3 is available on GitHub.

Borodovsky said the new algorithms “significantly expand the capabilities of our gene-finding tools for eukaryotic genomes. We envision exciting future applications of these tools, especially as initiatives like the Earth BioGenome Project aim to gain a deeper understanding of the tree of life.”

 

CITATION: Tomáš BrůnaAlexandre LomsadzeMark Borodovsky “GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes.” Genome Research

 

CITATION: Lars Gabriel, Tomáš Brůna, Katharina J. Hoff, Matthis Ebel, Alexandre Lomsadze, Mark Borodovsky, Mario Stanke. “BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA.” Genome Research

 

 

 

 

 

Contact

Jerry Grillo
Communications
Wallace H. Coulter Department of Biomedical Engineering

Faculty