Por Mauro Rebelo

Genome reassembly


Genome sequencing is a milestone in the life history of a species. With the information accumulated over almost 100 years of molecular biology for different organisms, we can make many inferences for a given species simply by looking at the sequence of its DNA.

The assembly of an organism’s genome for the first time is called ‘De Novo’. The process begins by estimating the size of the genome by comparison with the known genome of other organisms, such as our own, using a technique called cytometry.

By estimating the size of the genome, we can estimate the amount of sequencing that will be needed to make the assembly. Sequencing is still a stochastic process: there is no way of predicting which part of the genome will be sequenced and to ensure that we don’t have too much of the same part and nothing of another, we have to sequence the corresponding part several times the size of the genome. Between 10 and 100 times, depending on the sequencing technology.

Sequencing can be done in short fragments (between 150 and 400 base pairs) or long ones (up to 10,000 base pairs), but the first step is to assemble a ‘library’: the DNA, which is very long, is fragmented into pieces of different sizes and these fragments are circularized with the help of adapters (small DNA molecules with 10 known base pairs) and then fragmented again for sequencing.

Sequencing digitizes the chemical and biological information in DNA.

The assembly is done by algorithms that overlap the fragments based on the identity of the nitrogenous bases in the sequence. The assembly of small fragments generates larger fragments, which come together to form even larger fragments, which we call scaffolds, and which should be as large as the size of an organism’s chromosome. We do this process until we have a number of scaffolds equal to the number of chromosomes in the organism. In this way, we digitally recover the original genome sequence.

And then begins the most uncertain and laborious process, which is annotating the genome: identifying which sequences are genes, promoters, transposomes, micro- and macrostelites, SNPs and… which sequences are nothing. We used machine learning here, training an algorithm to find patterns of nitrogenous bases that are associated with genes in other organisms, to map these possible genes onto our genome again.

And finally, through an extensive literature review, we manually curate this automatic annotation.

Eventually, we go to the bench to confirm in vitro that a gene has the function we expected it to have. This validation is more expensive and time-consuming, but it is the only validation that can confirm the function of a gene without a shadow of a doubt.

Reassembling a genome is still a major scientific achievement and the first step towards developing biotechnological solutions for the control and conservation of species or the creation of products and services for the bioeconomy. And that’s why we specialize in it.

This is the list of some of the genomes we’ve sequenced:

Golden Mussel – Limnoperna fortunei (hyperlink)

  • Marcela Uliano-Silva, Francesco Dondero, Thomas Dan Otto, Igor Costa, Nicholas Costa Barroso Lima, Juliana Alves Americo, Camila Junqueira Mazzoni, Francisco Prosdocimi, Mauro de Freitas Rebelo, A hybrid-hierarchical genome assembly strategy to sequence the invasive golden mussel, Limnoperna fortunei, GigaScience, Volume 7, Issue 2, February 2018, gix128, https://doi.org/10.1093/gigascience/gix128

White Pitch – Protium kleinii (hyperlink)

  • First Draft Genome of a Brazilian Atlantic Rainforest Burseraceae reveals commercially-promising genes involved in terpenic oleoresins synthesis Luana Ferreira Afonso, Danielle Amaral, Marcela Uliano-Silva, André Luiz Quintanilha Torres, Daniel Reis Simas, Mauro de Freitas Rebelo bioRxiv 467720; doi: https://doi.org/10.1101/467720

Coral Sol – Tubastraea tagusensis, Tubastraea coccinea and Tubastraea sp (hyperlink)

    • Draft genome of the invasive coral Tubastraea sp. Giordano Bruno Soares-Souza, Danielle Amaral, André Q. Torres, Daniela Batista, Aline Silva Romão-Dumaresq, Luciana Leomil, Marcela Uliano-Silva, Francesco Dondero, Mauro de Freitas Rebelo bioRxiv 756999; doi: https://doi.org/10.1101/756999


  • The genomes of invasive coral Tubastraea spp. (Dendrophylliidae) as tool for the development of biotechnological solutions Giordano Bruno Soares-Souza, Danielle Amaral, Daniela Batista, André Q. Torres, Anna Carolini Silva Serra, Marcela Uliano-Silva, Luciana Leomil, Aryane Camos Reis, Elyabe Monteiro de Matos, Emiliano Calderon, Vriko Yu, Francesco Dondero, Saulo Marçal de Sousa, David Baker, Aline Dumaresq, Mauro F. Rebelo bioRxiv 2020.04.24.060574; doi: https://doi.org/10.1101/2020.04.24.060574