Publications – ADMIRABLE

2024

Barbero-Aparicio, José A.; Olivares-Gil, Alicia; Rodríguez, Juan J.; García-Osorio, César; Díez-Pastor, José F.

Addressing data scarcity in protein fitness landscape analysis: A study on semi-supervised and deep transfer learning techniques Journal Article

In: Information Fusion, vol. 102, pp. 102035, 2024, ISSN: 1566-2535.

Abstract | Links | BibTeX | Tags: bioinformatics, Machine learning, PID2020-119894GB-I00, Protein fitness prediction, Semi-supervised learning, Small datasets, SSL, Transfer learning

@article{barbero-aparicio2023b,

title = {Addressing data scarcity in protein fitness landscape analysis: A study on semi-supervised and deep transfer learning techniques},

author = {José A. Barbero-Aparicio and Alicia Olivares-Gil and Juan J. Rodríguez and César García-Osorio and José F. Díez-Pastor},

url = {https://www.sciencedirect.com/science/article/pii/S1566253523003512},

doi = {10.1016/j.inffus.2023.102035},

issn = {1566-2535},

year  = {2024},

date = {2024-01-01},

urldate = {2024-01-01},

journal = {Information Fusion},

volume = {102},

pages = {102035},

abstract = {This paper presents a comprehensive analysis of deep transfer learning methods, supervised methods, and semi-supervised methods in the context of protein fitness prediction, with a focus on small datasets. The analysis includes the exploration of the combination of different data sources to enhance the performance of the models. While deep learning and deep transfer learning methods have shown remarkable performance 

in situations with abundant data, this study aims to address the more realistic scenario faced by wet lab researchers, where labeled data is often limited. The novelty of this work lies in its examination of deep transfer learning in the context of small datasets and its consideration of semi-supervised methods and multi-view strategies. While previous research has extensively explored deep transfer learning in large dataset scenarios, little attention has been given to its efficacy in small dataset settings or its comparison with semi-supervised approaches. Our findings suggest that deep transfer learning, exemplified by ProteinBERT, shows promising performance in this context compared to the rest of the methods across various evaluation metrics, not only in small dataset contexts but also in large dataset scenarios. This highlights the robustness and versatility of deep transfer learning in protein fitness prediction tasks, even with limited labeled data. The results of this study shed light on the potential of deep transfer learning as a state-of-the-art approach in the field of protein fitness prediction. By leveraging pre-trained models and fine-tuning them on small datasets, researchers can achieve competitive performance surpassing traditional supervised and semi-supervised methods. These findings provide valuable insights for wet lab researchers who face the challenge of limited labeled data, enabling them to make informed decisions when selecting the most effective methodology for their specific protein fitness prediction tasks. Additionally, the study investigated the combination of two different sources of information (encodings) through our enhanced semi-supervised methods, yielding noteworthy results improving their base model and providing valuable insights for further research. The presented analysis contributes to a better understanding of the capabilities and limitations of different learning approaches in small dataset scenarios, ultimately aiding in the development of improved protein fitness prediction methods},

keywords = {bioinformatics, Machine learning, PID2020-119894GB-I00, Protein fitness prediction, Semi-supervised learning, Small datasets, SSL, Transfer learning},

pubstate = {published},

tppubtype = {article}

}

This paper presents a comprehensive analysis of deep transfer learning methods, supervised methods, and semi-supervised methods in the context of protein fitness prediction, with a focus on small datasets. The analysis includes the exploration of the combination of different data sources to enhance the performance of the models. While deep learning and deep transfer learning methods have shown remarkable performance
in situations with abundant data, this study aims to address the more realistic scenario faced by wet lab researchers, where labeled data is often limited. The novelty of this work lies in its examination of deep transfer learning in the context of small datasets and its consideration of semi-supervised methods and multi-view strategies. While previous research has extensively explored deep transfer learning in large dataset scenarios, little attention has been given to its efficacy in small dataset settings or its comparison with semi-supervised approaches. Our findings suggest that deep transfer learning, exemplified by ProteinBERT, shows promising performance in this context compared to the rest of the methods across various evaluation metrics, not only in small dataset contexts but also in large dataset scenarios. This highlights the robustness and versatility of deep transfer learning in protein fitness prediction tasks, even with limited labeled data. The results of this study shed light on the potential of deep transfer learning as a state-of-the-art approach in the field of protein fitness prediction. By leveraging pre-trained models and fine-tuning them on small datasets, researchers can achieve competitive performance surpassing traditional supervised and semi-supervised methods. These findings provide valuable insights for wet lab researchers who face the challenge of limited labeled data, enabling them to make informed decisions when selecting the most effective methodology for their specific protein fitness prediction tasks. Additionally, the study investigated the combination of two different sources of information (encodings) through our enhanced semi-supervised methods, yielding noteworthy results improving their base model and providing valuable insights for further research. The presented analysis contributes to a better understanding of the capabilities and limitations of different learning approaches in small dataset scenarios, ultimately aiding in the development of improved protein fitness prediction methods

2023

Barbero-Aparicio, José A.; Olivares-Gil, Alicia; Díez-Pastor, José F.; García-Osorio, César

Deep learning and support vector machines for transcription start site identification Journal Article

In: PeerJ Computer Science, vol. 9, iss. e1340, 2023, ISSN: 2376-5992.

Abstract | Links | BibTeX | Tags: bioinformatics, Convolutional neural network, Deep learning, Long short-term memory, Machine learning, PID2020-119894GB-I00, Support vector machines, transcription start site

@article{barbero-aparicio2023,

title = {Deep learning and support vector machines for transcription start site identification},

author = {José A. Barbero-Aparicio and Alicia Olivares-Gil and José F. Díez-Pastor and César García-Osorio},

editor = {Carlos Fernandez-Lozano},

url = {https://doi.org/10.7717/peerj-cs.1340},

doi = {10.7717/peerj-cs.1340},

issn = {2376-5992},

year  = {2023},

date = {2023-04-17},

urldate = {2023-04-17},

journal = {PeerJ Computer Science},

volume = {9},

issue = {e1340},

abstract = {Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments},

keywords = {bioinformatics, Convolutional neural network, Deep learning, Long short-term memory, Machine learning, PID2020-119894GB-I00, Support vector machines, transcription start site},

pubstate = {published},

tppubtype = {article}

}

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments