Machine learning with scarcely labeled data for Industry 4.0
(Ref: PID2020-119894GB-I00)

logos

Project summary

Although the benefits of AI are increasingly accepted, its adoption in the industry is far from optimal. Some AI techniques have found their place, for example fuzzy logic, others more recent have not yet had enough penetration. In this project we will focus on solving problems in the manufacturing industry for which there is not enough labeled data, where classical methods fail, and therefore more recent machine learning methods are necessary.

Within AI, machine learning focuses on creating algorithms that “learn” using the analysis of historical data, being able to discover hidden trends and patterns, which help in decision-making processes. Traditionally, there is a difference between supervised and unsupervised learning. In the first one, it is necessary to register data in which both the values of the input variable and output variable to be predicted are known (if the output variable takes a value from a finite set of possible values, we would be facing a problem of classification, if the output is a continuous value, we would have a regression problem). The result of the learning process is a model capable of predicting the output value for new input values. In an industrial context this is very useful because the output variable whose prediction is learned could be something like the time before the next failure (which would help to make the decision to carry out an action of preventive maintenance) or could be the roughness of a part to be machined (which will allow changing the parameters of the machining process in order to obtain the desired roughness). On the other hand, in unsupervised learning, there is no variable to predict, but instead we want to discover other types of patterns, such as the similarity between the data, or the fact that a set of measurements corresponds to an outlier (which could be an indication of a malfunction of the machine).

The problem with supervised learning is that the historical data often has to be manually tagged, which makes obtaining training data a very time-consuming and expensive process. To alleviate this issue, semi-supervised learning methods have recently emerged, in which, starting from a limited set of labeled data, they also take advantage of the existing structure in the unlabeled data to learn more robust models than those that could have been obtained using only the small available labeled data.

Our group has its origin in the design of classification algorithms. In different research projects we have been able to adapt them to other types of problems, such as regression problems (continuous output variable) or multi-label and multi-output problems (where several output variables are learned simultaneously by exploiting also the relationship that may exist between them). In this project we propose to adapt these algorithms to problems in which the number of labeled data is limited, that is, we want to design new semi-supervised learning algorithms and use them to solve industrial problems.

EyeVR dataset (available to download)

As part of the project, a dataset was created a dataset (EyeVR) was created for user perfomance classification in the use of cranes in industrial contexts in virtual reality.

This dataset contains (a) tags categorising the performance of the user while using the system (b) measurements of variables collected by the eye-tracking sensors integrated in the virtual reality headsets while the participants performed the proposed tasks.

The proposed exercises, a total of 11, were carried out over 3 different sessions. They are exercises of different levels of difficulty.
More details about them are provided in the metadata folder, in our previous research publications and in the data descriptor accompanying this dataset. The dataset includes data from a total of 71 participants along three experiences.

More information on the dataset and the experiences carried out to collect data can be found in our research works and the dataset documentation:

Serrano-Mamolar, A., Miguel-Alonso, I., Checa, D., & Pardo-Aguilar, C. (2023). Towards learner performance evaluation in iVR learning environments using eye-tracking and Machine-learning. Comunicar, 31(76), 9–20. Retrieved from https://doi.org/10.3916/C76-2023-01
Ramírez-Sanz, J.M., Peña-Alonso, H.M., Serrano-Mamolar, A., Arnaiz-González, Á., Bustillo, A. (2023). Detection of Stress Stimuli in Learning Contexts of iVR Environments. In: De Paolis, L.T., Arpaia, P., Sacco, M. (eds) Extended Reality. XR Salento 2023. Lecture Notes in Computer Science, vol 14219. Springer, Cham. https://doi.org/10.1007/978-3-031-43404-4_29

Link to download the dataset

Publications

2024

Garrido-Labrador, José Luis; Serrano-Mamolar, Ana; Maudes-Raedo, Jesús; Rodríguez, Juan J.; García-Osorio, César

Ensemble methods and semi-supervised learning for information fusion: A review and future research directions Journal Article

In: Information Fusion, vol. 107, 2024, ISSN: 1566-2535.

Links | BibTeX

Kuncheva, Ludmila I.; Garrido-Labrador, José Luis; Ramos-Pérez, Ismael; Hennessey, Samuel L.; Rodríguez, Juan J.

Semi-supervised classification with pairwise constraints: A case study on animal identification from video Journal Article

In: Information Fusion, vol. 104, 2024, ISSN: 1566-2535.

Links | BibTeX

Ramos-Pérez, Ismael; Barbero-Aparicio, José Antonio; Canepa-Oneto, Antonio; Arnaiz-González, Álvar; Maudes-Raedo, Jesús

An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data Journal Article

In: Information, vol. 15, no. 4, 2024, ISSN: 2078-2489.

Abstract | Links | BibTeX

Maestro-Prieto, Jose Alberto; Ramírez-Sanz, José Miguel; Andrés Bustillo, and Juan José Rodriguez-Díez

Semi-supervised diagnosis of wind-turbine gearbox misalignment and imbalance faults Journal Article

In: Applied Intelligence, 2024, ISSN: 1573-7497.

Abstract | Links | BibTeX

@article{Maestro-Prieto2024,

title = {Semi-supervised diagnosis of wind-turbine gearbox misalignment and imbalance faults},

author = {Jose Alberto Maestro-Prieto and José Miguel Ramírez-Sanz and Andrés Bustillo,and Juan José Rodriguez-Díez},

url = {https://doi.org/10.1007/s10489-024-05373-6},

doi = {10.1007/s10489-024-05373-6},

issn = {1573-7497},

year  = {2024},

date = {2024-03-28},

urldate = {2024-03-28},

journal = {Applied Intelligence},

abstract = {Both wear-induced bearing failure and misalignment of the powertrain between the rotor and the electrical generator are common failure modes in wind-turbine motors. In this study, Semi-Supervised Learning (SSL) is applied to a fault detection and diagnosis solution. Firstly, a dataset is generated containing both normal operating patterns and seven different failure classes of the two aforementioned failure modes that vary in intensity. Several datasets are then generated, maintaining different numbers of labeled instances and unlabeling the others, in order to evaluate the number of labeled instances needed for the desired accuracy level. Subsequently, different types of SSL algorithms and combinations of algorithms are trained and then evaluated with the test data. The results showed that an SSL approach could improve the accuracy of trained classifiers when a small number of labeled instances were used together with many unlabeled instances to train a Co-Training algorithm or combinations of such algorithms. When a few labeled instances (fewer than 10% or 327 instances, in this case) were used together with unlabeled instances, the SSL algorithms outperformed the result obtained with the Supervised Learning (SL) techniques used as a benchmark. When the number of labeled instances was sufficient, the SL algorithm (using only labeled instances) performed better than the SSL algorithms (accuracy levels of 87.04% vs. 86.45%, when labeling 10% of instances). A competitive accuracy of 97.73% was achieved with the SL algorithm processing a subset of 40% of the labeled instances.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Martin-Melero, Íñigo; Serrano-Mamolar, Ana; Rodríguez-Diez, Juan J.

Evaluation of Semi-Supervised Machine Learning applied to Affective State Detection Proceedings Article

In: IEEE, 2024.

Links | BibTeX

Garrido-Labrador, José Luis; Serrano-Mamolar, Ana; Maudes-Raedo, Jesús; Rodríguez, Juan José; García-Osorio, César

Ensemble methods and semi-supervised learning for information fusion: A review and future research directions Journal Article

In: Information Fusion, vol. 107, 2024.

Abstract | Links | BibTeX

Barbero-Aparicio, José A.; Olivares-Gil, Alicia; Rodríguez, Juan J.; García-Osorio, César; Díez-Pastor, José F.

Addressing data scarcity in protein fitness landscape analysis: A study on semi-supervised and deep transfer learning techniques Journal Article

In: Information Fusion, vol. 102, pp. 102035, 2024, ISSN: 1566-2535.

Abstract | Links | BibTeX

@article{barbero-aparicio2023b,

title = {Addressing data scarcity in protein fitness landscape analysis: A study on semi-supervised and deep transfer learning techniques},

author = {José A. Barbero-Aparicio and Alicia Olivares-Gil and Juan J. Rodríguez and César García-Osorio and José F. Díez-Pastor},

url = {https://www.sciencedirect.com/science/article/pii/S1566253523003512},

doi = {10.1016/j.inffus.2023.102035},

issn = {1566-2535},

year  = {2024},

date = {2024-01-01},

urldate = {2024-01-01},

journal = {Information Fusion},

volume = {102},

pages = {102035},

abstract = {This paper presents a comprehensive analysis of deep transfer learning methods, supervised methods, and semi-supervised methods in the context of protein fitness prediction, with a focus on small datasets. The analysis includes the exploration of the combination of different data sources to enhance the performance of the models. While deep learning and deep transfer learning methods have shown remarkable performance 

in situations with abundant data, this study aims to address the more realistic scenario faced by wet lab researchers, where labeled data is often limited. The novelty of this work lies in its examination of deep transfer learning in the context of small datasets and its consideration of semi-supervised methods and multi-view strategies. While previous research has extensively explored deep transfer learning in large dataset scenarios, little attention has been given to its efficacy in small dataset settings or its comparison with semi-supervised approaches. Our findings suggest that deep transfer learning, exemplified by ProteinBERT, shows promising performance in this context compared to the rest of the methods across various evaluation metrics, not only in small dataset contexts but also in large dataset scenarios. This highlights the robustness and versatility of deep transfer learning in protein fitness prediction tasks, even with limited labeled data. The results of this study shed light on the potential of deep transfer learning as a state-of-the-art approach in the field of protein fitness prediction. By leveraging pre-trained models and fine-tuning them on small datasets, researchers can achieve competitive performance surpassing traditional supervised and semi-supervised methods. These findings provide valuable insights for wet lab researchers who face the challenge of limited labeled data, enabling them to make informed decisions when selecting the most effective methodology for their specific protein fitness prediction tasks. Additionally, the study investigated the combination of two different sources of information (encodings) through our enhanced semi-supervised methods, yielding noteworthy results improving their base model and providing valuable insights for further research. The presented analysis contributes to a better understanding of the capabilities and limitations of different learning approaches in small dataset scenarios, ultimately aiding in the development of improved protein fitness prediction methods},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

This paper presents a comprehensive analysis of deep transfer learning methods, supervised methods, and semi-supervised methods in the context of protein fitness prediction, with a focus on small datasets. The analysis includes the exploration of the combination of different data sources to enhance the performance of the models. While deep learning and deep transfer learning methods have shown remarkable performance
in situations with abundant data, this study aims to address the more realistic scenario faced by wet lab researchers, where labeled data is often limited. The novelty of this work lies in its examination of deep transfer learning in the context of small datasets and its consideration of semi-supervised methods and multi-view strategies. While previous research has extensively explored deep transfer learning in large dataset scenarios, little attention has been given to its efficacy in small dataset settings or its comparison with semi-supervised approaches. Our findings suggest that deep transfer learning, exemplified by ProteinBERT, shows promising performance in this context compared to the rest of the methods across various evaluation metrics, not only in small dataset contexts but also in large dataset scenarios. This highlights the robustness and versatility of deep transfer learning in protein fitness prediction tasks, even with limited labeled data. The results of this study shed light on the potential of deep transfer learning as a state-of-the-art approach in the field of protein fitness prediction. By leveraging pre-trained models and fine-tuning them on small datasets, researchers can achieve competitive performance surpassing traditional supervised and semi-supervised methods. These findings provide valuable insights for wet lab researchers who face the challenge of limited labeled data, enabling them to make informed decisions when selecting the most effective methodology for their specific protein fitness prediction tasks. Additionally, the study investigated the combination of two different sources of information (encodings) through our enhanced semi-supervised methods, yielding noteworthy results improving their base model and providing valuable insights for further research. The presented analysis contributes to a better understanding of the capabilities and limitations of different learning approaches in small dataset scenarios, ultimately aiding in the development of improved protein fitness prediction methods

2023

Ramírez-Sanz, José Miguel; Maestro-Prieto, Jose-Alberto; Arnaiz-González, Álvar; Bustillo, Andrés

Semi-supervised learning for industrial fault detection and diagnosis: A systemic review Journal Article

In: ISA Transactions, vol. 143, pp. 255–270, 2023, ISSN: 0019-0578.

Links | BibTeX

Mena-Alonso, Álvaro; Latorre-Carmona, Pedro; González, Dorys C.; Díez-Pastor, José F.; Rodríguez, Juan J.; Mínguez, Jesús; Vicente, Miguel A.

A cost-effective stereo camera-based system for measuring crack propagation in fibre-reinforced concrete Journal Article

In: Archiv.Civ.Mech.Eng, vol. 23, no. 3, 2023, ISSN: 2083-3318.

Abstract | Links | BibTeX

Kuncheva, Ludmila I.; Garrido-Labrador, José Luis; Ramos-Pérez, Ismael; Hennessey, Samuel L.; Rodríguez, Juan J.

An experiment on animal re-identification from video Journal Article

In: Ecological Informatics, vol. 74, 2023, ISSN: 1574-9541.

Links | BibTeX

Barbero-Aparicio, José A.; Olivares-Gil, Alicia; Díez-Pastor, José F.; García-Osorio, César

Deep learning and support vector machines for transcription start site identification Journal Article

In: PeerJ Computer Science, vol. 9, iss. e1340, 2023, ISSN: 2376-5992.

Abstract | Links | BibTeX

@article{barbero-aparicio2023,

title = {Deep learning and support vector machines for transcription start site identification},

author = {José A. Barbero-Aparicio and Alicia Olivares-Gil and José F. Díez-Pastor and César García-Osorio},

editor = {Carlos Fernandez-Lozano},

url = {https://doi.org/10.7717/peerj-cs.1340},

doi = {10.7717/peerj-cs.1340},

issn = {2376-5992},

year  = {2023},

date = {2023-04-17},

urldate = {2023-04-17},

journal = {PeerJ Computer Science},

volume = {9},

issue = {e1340},

abstract = {Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments

Setó-Rey, Daniel; Santos-Martín, José Ignacio; López-Nozal, Carlos

Vulnerability of Package Dependency Networks Journal Article

In: IEEE Trans. Netw. Sci. Eng., pp. 1–13, 2023, ISSN: 2327-4697.

Links | BibTeX

2022

Pimenov, Danil Yurievich; Bustillo, Andrés; Wojciechowski, Szymon; Sharma, Vishal Santosh; Gupta, Munish Kumar; Kuntğlu, Mustafa

Artificial intelligence systems for tool condition monitoring in machining: analysis and critical review Journal Article

In: Journal of Intelligent Manufacturing, vol. 2022, 2022, ISSN: 0956-5515.

Abstract | Links | BibTeX

@article{Pimenov2022,

title = {Artificial intelligence systems for tool condition monitoring in machining: analysis and critical review},

author = {Danil Yurievich Pimenov and Andrés Bustillo and Szymon Wojciechowski and Vishal Santosh Sharma and Munish Kumar Gupta and Mustafa Kuntğlu},

url = {https://link.springer.com/article/10.1007/s10845-022-01923-2#citeas},

doi = {10.1007/s10845-022-01923-2},

issn = {0956-5515},

year  = {2022},

date = {2022-03-12},

urldate = {2022-03-12},

journal = {Journal of Intelligent Manufacturing},

volume = {2022},

abstract = {The wear of cutting tools, cutting force determination, surface roughness variations and other machining responses are of keen interest to latest researchers. The variations of these machining responses results in change in dimensional accuracy and productivity upto great extent. In addition, an excessive increase in wear leads to catastrophic consequences, exceeding the tool breakage. Therefore, this article discusses the online trend of modern approaches in tool condition monitoring while different machining operations. For this purpose, the effective use of new sensors and artificial intelligence (AI) is considered and followed during this holistic review work. The sensor systems used for monitoring tool wear are dynamometers, accelerometers, acoustic emission sensors, current and power sensors, image sensors, other sensors. These systems allow to solve the problem of automation and modeling of technological parameters of the main types of cutting, such as turning, milling, drilling and grinding. The modern artificial intelligence methods are considered, such as: Neural networks, Image recognition, Fuzzy logic, Adaptive neuro-fuzzy inference systems, Bayesian Networks, Support vector machine, Ensembles, Decision and regression trees, k-nearest neighbors, Artificial Neural Network, Markov model, Singular Spectrum Analysis, Genetic algorithms. Discussions also includes the main advantages, disadvantages and prospects of using various AI methods for tool wear monitoring. Moreover, the problems and future directions of the main processing methods using AI models are also highlighted.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Ramos-Pérez, Ismael; Arnaiz-González, Álvar; Rodríguez, Juan José; García-Osorio, César

When is resampling beneficial for feature selection with imbalanced wide data? Journal Article

In: Expert Systems with Applications, vol. 188, pp. 116015, 2022, ISSN: 0957-4174.

Abstract | Links | BibTeX

Olivares-Gil, Alicia; Arnaiz-Rodríguez, Adrián; Ramírez-Sanz, José Miguel; Garrido-Labrador, José Luis; Ahedo, Virginia; García-Osorio, César; Santos, José Ignacio; Galán, José Manuel

Mapping the scientific structure of organization and management of enterprises using complex networks Journal Article

In: Int. J. Prod. Manag. Eng., vol. 10, no. 1, pp. 65–76, 2022, ISSN: 2340-4876.

Abstract | Links | BibTeX

Cruz, David Checa; Urbikain, Gorka; Beranoagirre, Aitor; Bustillo, Andrés; Lacalle, Luis Norberto López

Using Machine-Learning techniques and Virtual Reality to design cutting tools for energy optimization in milling operations Journal Article

In: International Journal of Computer Integrated Manufacturing, vol. 35, no. 1, pp. 1-21, 2022, ISSN: 0951-192X.

Abstract | Links | BibTeX

@article{Cruz2022b,

title = {Using Machine-Learning techniques and Virtual Reality to design cutting tools for energy optimization in milling operations},

author = {David Checa Cruz and Gorka Urbikain and Aitor Beranoagirre and Andrés Bustillo and Luis Norberto López Lacalle},

url = {https://www.tandfonline.com/doi/full/10.1080/0951192X.2022.2027020},

doi = {10.1080/0951192X.2022.2027020},

issn = {0951-192X},

year  = {2022},

date = {2022-01-19},

urldate = {2022-01-19},

journal = {International Journal of Computer Integrated Manufacturing},

volume = {35},

number = {1},

pages = {1-21},

abstract = {The selection of a proper cutting tool in machining operations is a critical issue. Tool geometric parameters are essential for milling performance. However, the process engineer has very limited experience of the best parameter combination, due to the high cost of cutting tool tests. The same holds true for bachelor studies on machining processes. This study proposes a new strategy that combines experimental tests, machine-learning modelling and Virtual Reality visualization to overcome these limitations. First, tools with different geometric parameters are tested. Second, the experimental data are modeled with different machine-learning techniques (regression trees, multilayer perceptrons, bagging and random forest ensembles). An in-depth analysis of the influence of each input on model accuracy is performed to reduce experimental costs. The results show that the best model with no cutting-force inputs performed worse than the best model with all the inputs. Third, the most accurate model is used to build 3D graphs of special interest to engineering students as well as process engineers, for the optimization of power consumption under different cutting conditions. Finally, a Virtual Reality environment is presented to train engineering students in the study of the best tool design and cutting parameter optimization.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

2021

Díez-Pastor, José Francisco; Latorre-Carmona, Pedro; Garrido-Labrador, José Luis; Ramírez-Sanz, José Miguel; Rodríguez, Juan J.

Experimental Assessment of Feature Extraction Techniques Applied to the Identification of Properties of Common Objects, Using a Radar System Journal Article

In: Applied Sciences, vol. 11, no. 15, 2021, ISSN: 2076-3417.

Abstract | Links | BibTeX

logos

Machine learning with scarcely labeled data for Industry 4.0 (Ref: PID2020-119894GB-I00)

Project summary

EyeVR dataset (available to download)

Publications

2024

2023

2022

2021

Machine learning with scarcely labeled data for Industry 4.0
(Ref: PID2020-119894GB-I00)