The complexity of an organism emerges from its genome, the book that contains its DNA's instructions for life. The method for reading this book, sequencing, has evolved towards reading increasingly longer fragments of the genome.
In this field, a research group led by the Institute of Integrative Systems Biology (I2SysBio), a joint center of the Higher Council for Scientific Research (CSIC) and the University of Valencia (UV), in Spain, has created and presented publicly in the academic journal Nature Methods an improvement of a proprietary computer program capable of discovering new transcripts, RNA molecules that genes use to synthesize proteins and create tissues, from their sequencing with long-read instruments, as well as assigning them a function in the formation of the organism.
Long-read sequencing is the third generation of genome sequencing methods. Compared to reading short fragments, which analyzes about 200 nucleotides (the letters that make up genes), long reading methods can obtain reads 100 times longer, about 20,000 nucleotides, which leaves fewer gaps in the genome information for fill out using bioinformatics tools. This was one of the reasons why Nature Methods itself considered it Method of the Year 2022.
A few years earlier, in 2018, researcher Ana Conesa, then at the University of Florida, developed a computer program called SQANTI to analyze the information that was extracted using these long-read methods. Now, his research team at I2SysBio, which includes Francisco J. Pardo-Palacios, presents a substantial improvement to this software that can be freely used in the main commercial systems that employ long read sequencing, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).
“Long read techniques better analyze the complexity of human transcripts and transcriptome,” says Conesa. This identifies the portion of the genome that is read in each cell to give rise to tissues and organs. Thus, a single gene can give rise, through small changes in the structure of the RNA it encodes, to a great diversity of transcripts, and with them proteins with different cellular functions. “Short read sequencing cannot solve this puzzle. Long reading better reconstructs the functional complexity of the human transcriptome, something key to studying certain diseases, especially neurological diseases and cancer,” says the CSIC researcher.
Artistic recreation of DNA. (Illustration: Amazings/NCYT)
Better understand the complexity of the body and diseases
The version published now, SQANTI3, solves some previous problems, derived from the degradation of RNA or the unique analysis of each molecule, to introduce notable improvements. The program is now capable of discovering new transcripts that were not in the genome databases used by these computer programs. Furthermore, through artificial intelligence techniques, the software can assign functional information to the new transcript, “something essential to understand the functional complexity of the organism and the diseases,” highlights Conesa.
To develop this computer program, the I2SysBio Garnatxa computing cluster has been used, which has 15 computing nodes capable of offering 950 parallel computing threads. In addition, the Gene Expression Genomics group led by Ana Conesa at I2SysBio participates in ELIXIR, one of the strategic infrastructures for the European Strategic Forum on Research Infrastructures (ESFRI) that allows life sciences laboratories across Europe to share and store your data.
The University of Florida and Pacific Biosciences collaborated in the development of SQANTI3, one of the companies that markets the technology for long-read sequencing through its PacBio system, which recommends the use of Spanish software to analyze its data. The use of the computer program is free, and it already has “thousands of users around the world,” according to Conesa, although “the success of this tool also requires more technical staff to respond to the numerous requests we receive.” Thus, the researcher has co-led the recent launch of the CSIC Connection of Computational Biology and Bioinformatics, a platform to connect people, methods and resources in these fields at the CSIC. (Source: Isidoro García / CSIC)