The following is a short list of selected publications, to see the full list, go to the Publications page.
Ramos-Pérez, Ismael; Arnaiz-González, Álvar; Rodríguez, Juan José; García-Osorio, César
In: Expert Systems with Applications, vol. 188, pp. 116015, 2022, ISSN: 0957-4174.
This paper studies the effects that combinations of balancing and feature selection techniques have on wide data (many more attributes than instances) when different classifiers are used. For this, an extensive study is done using 14 datasets, 3 balancing strategies, and 7 feature selection algorithms. The evaluation is carried out using 5 classification algorithms, analyzing the results for different percentages of selected features, and establishing the statistical significance using Bayesian tests.
Some general conclusions of the study are that it is better to use RUS before the feature selection, while ROS and SMOTE offer better results when applied afterwards. Additionally, specific results are also obtained depending on the classifier used, for example, for Gaussian SVM the best performance is obtained when the feature selection is done with SVM-RFE before balancing the data with RUS.
Cruz, David Checa; Saucedo-Dorantes, Juan José; Ríos, Roque Alfredo Osorno; Antonio-Daviu, José Alfonso; Bustillo, Andrés
In: Applied Sciences, vol. 12, no. 1, pp. 414, 2022, ISSN: 2076-3417.
The incorporation of new technologies as training methods, such as virtual reality (VR), facilitates instruction when compared to traditional approaches, which have shown strong limitations in their ability to engage young students who have grown up in the smartphone culture of continuous entertainment. Moreover, not all educational centers or organizations are able to incorporate specialized labs or equipment for training and instruction. Using VR applications, it is possible to reproduce training programs with a high rate of similarity to real programs, filling the gap in traditional training. In addition, it reduces unnecessary investment and prevents economic losses, avoiding unnecessary damage to laboratory equipment. The contribution of this work focuses on the development of a VR-based teaching and training application for the condition-based maintenance of induction motors. The novelty of this research relies mainly on the use of natural interactions with the VR environment and the design’s optimization of the VR application in terms of the proposed teaching topics. The application is comprised of two training modules. The first module is focused on the main components of induction motors, the assembly of workbenches and familiarization with induction motor components. The second module employs motor current signature analysis (MCSA) to detect induction motor failures, such as broken rotor bars, misalignments, unbalances, and gradual wear on gear case teeth. Finally, the usability of this VR tool has been validated with both graduate and undergraduate students, assuring the suitability of this tool for: (1) learning basic knowledge and (2) training in practical skills related to the condition-based maintenance of induction motors.
Juez-Gil, Mario; Arnaiz-González, Álvar; Rodríguez, Juan José; López-Nozal, Carlos; García-Osorio, César
Rotation Forest for Big Data Journal Article
In: Information Fusion, vol. 74, pp. 39-49, 2021, ISSN: 1566-2535.
The Rotation Forest classifier is a successful ensemble method for a wide variety of data mining applications. However, the way in which Rotation Forest transforms the feature space through PCA, although powerful, penalizes training and prediction times, making it unfeasible for Big Data. In this paper, a MapReduce Rotation Forest and its implementation under the Spark framework are presented. The proposed MapReduce Rotation Forest behaves in the same way as the standard Rotation Forest, training the base classifiers on a rotated space, but using a functional implementation of the rotation that enables its execution in Big Data frameworks. Experimental results are obtained using different cloud-based cluster configurations. Bayesian tests are used to validate the method against two ensembles for Big Data: Random Forest and PCARDE classifiers. Our proposal incorporates the parallelization of both the PCA calculation and the tree training, providing a scalable solution that retains the performance of the original Rotation Forest and achieves a competitive execution time (in average, at training, more than 3 times faster than other PCA-based alternatives). In addition, extensive experimentation shows that by setting some parameters of the classifier (i.e., bootstrap sample size, number of trees, and number of rotations), the execution time is reduced with no significant loss of performance using a small ensemble.
Rodríguez, Juan José; Juez-Gil, Mario; López-Nozal, Carlos; Arnaiz-González, Álvar
Rotation Forest for multi-target regression Journal Article
In: International Journal of Machine Learning and Cybernetics, 2021, ISSN: 1868-808X.
The prediction of multiple numeric outputs at the same time is called multi-target regression (MTR), and it has gained attention during the last decades. This task is a challenging research topic in supervised learning because it poses additional difficulties to traditional single-target regression (STR), and many real-world problems involve the prediction of multiple targets at once. One of the most successful approaches to deal with MTR, although not the only one, consists in transforming the problem in several STR problems, whose outputs will be combined building up the MTR output. In this paper, the Rotation Forest ensemble method, previously proposed for single-label classification and single-target regression, is adapted to MTR tasks and tested with several regressors and data sets. Our proposal rotates the input space in an efficient and novel fashion, avoiding extra rotations forced by MTR problem decomposition. Four approaches for MTR are used: single-target (ST), stacked-single target (SST), Ensembles of Regressor Chains (ERC), and Multi-target Regression via Quantization (MRQ). For assessing the benefits of the proposal, a thorough experimentation with 28 MTR data sets and statistical tests are used, concluding that Rotation Forest, adapted by means of these approaches, outperforms other popular ensembles, such as Bagging and Random Forest.
Díez-Pastor, José Francisco; del Val, Alain Gil; Veiga, Fernando; Bustillo, Andrés
In: Measurement, vol. 168, no. 108328, 2021, ISSN: 0263-2241.
Industrial threading processes that use cutting taps are in high demand. However, industrial conditions differ markedly from laboratory conditions. In this study, a machine-learning solution is presented for the correct classification of threads, based on industrial requirements, to avoid expensive manual measurement of quality indicators. First, quality states are categorized. Second, process inputs are extracted from the torque signals including statistical parameters. Third, different machine-learning algorithms are tested: from base classifiers, such as decision trees and multilayer perceptrons, to complex ensembles of classifiers especially designed for imbalanced datasets, such as boosting and bagging decision-tree ensembles combined with SMOTE and under-sampling balancing techniques. Ensembles demonstrated the lowest sensitivity to window sizes, the highest accuracy for smaller window sizes, and the greatest learning ability with small datasets. Fourth, the combination of models with both high Recall and high Precision resulted in a reliable industrial tool, tested on an extensive experimental dataset.
Juez-Gil, Mario; Saucedo-Dorantes, Juan José; Arnaiz-González, Álvar; López-Nozal, Carlos; García-Osorio, César; Lowe, David
In: ISA Transactions, vol. 106, pp. 367-381, 2020, ISSN: 0019-0578.
The detection of faulty machinery and its automated diagnosis is an industrial priority because efficient fault diagnosis implies efficient management of the maintenance times, reduction of energy consumption, reduction in overall costs and, most importantly, the availability of the machinery is ensured. Thus, this paper presents a new intelligent multi-fault diagnosis method based on multiple sensor information for assessing the occurrence of single, combined, and simultaneous faulty conditions in an induction motor. The contribution and novelty of the proposed method include the consideration of different physical magnitudes such as vibrations, stator currents, voltages, and rotational speed as a meaningful source of information of the machine condition. Moreover, for each available physical magnitude, the reduction of the original number of attributes through the Principal Component Analysis leads to retain a reduced number of significant features that allows achieving the final diagnosis outcome by a multi-label classification tree. The effectiveness of the method was validated by using a complete set of experimental data acquired from a laboratory electromechanical system, where a healthy and seven faulty scenarios were assessed. Also, the interpretation of the results do not require any prior expert knowledge and the robustness of this proposal allows its application in industrial applications, since it may deal with different operating conditions such as different loads and operating frequencies. Finally, the performance was evaluated using multi-label measures, which to the best of our knowledge, is an innovative development in the field condition monitoring and fault identification.
Rodríguez, Juan José; Juez-Gil, Mario; Arnaiz-González, Álvar; Kuncheva, Ludmila I
An experimental evaluation of mixup regression forests Journal Article
In: Expert Systems with Applications, vol. 151, no. 113376, 2020, ISSN: 0957-4174.
Over the past few decades, the remarkable prediction capabilities of ensemble methods have been used within a wide range of applications. Maximization of base-model ensemble accuracy and diversity are the keys to the heightened performance of these methods. One way to achieve diversity for training the base models is to generate artificial/synthetic instances for their incorporation with the original instances. Recently, the mixup method was proposed for improving the classification power of deep neural networks (Zhang, Cissé, Dauphin, and Lopez-Paz, 2017). Mixup method generates artificial instances by combining pairs of instances and their labels, these new instances are used for training the neural networks promoting its regularization. In this paper, new regression tree ensembles trained with mixup, which we will refer to as Mixup Regression Forest, are presented and tested. The experimental study with 61 datasets showed that the mixup approach improved the results of both Random Forest and Rotation Forest.
Rodríguez, Juan José; Díez-Pastor, José Francisco; Arnaiz-González, Álvar; Kuncheva, Ludmila I
Random Balance ensembles for multiclass imbalance learning Journal Article
In: Knowledge-Based Systems, 2020, ISSN: 0950-7051.
Random Balance strategy (RandBal) has been recently proposed for constructing classifier ensembles for imbalanced, two-class data sets. In RandBal, each base classifier is trained with a sample of the data with a random class prevalence, independent of the a priori distribution. Hence, for each sample, one of the classes will be undersampled while the other will be oversampled. RandBal can be applied on its own or can be combined with any other ensemble method. One particularly successful variant is RandBalBoost which integrates Random Balance and boosting. Encouraged by the success of RandBal, this work proposes two approaches which extend RandBal to multiclass imbalance problems. Multiclass imbalance implies that at least two classes have substantially different proportion of instances. In the first approach proposed here, termed Multiple Random Balance (MultiRandBal), we deal with all classes simultaneously. The training data for each base classifier are sampled with random class proportions. The second approach we propose decomposes the multiclass problem into two-class problems using one-vs-one or one-vs-all, and builds an ensemble of RandBal ensembles. We call the two versions of the second approach OVO-RandBal and OVA-RandBal, respectively. These two approaches were chosen because they are the most straightforward extensions of RandBal for multiple classes. Our main objective is to evaluate both approaches for multiclass imbalanced problems. To this end, an experiment was carried out with 52 multiclass data sets. The results suggest that both MultiRandBal, and OVO/OVA-RandBal are viable extensions of the original two-class RandBal. Collectively, they consistently outperform acclaimed state-of-the art methods for multiclass imbalanced problems.
Checa, David; Bustillo, Andrés
In: Multimedia Tools and Applications, pp. 1-21, 2019, ISSN: 1380-7501.
The merger of game-based approaches and Virtual Reality (VR) environments that can enhance learning and training methodologies have a very promising future, reinforced by the widespread market-availability of affordable software and hardware tools for VR-environments. Rather than passive observers, users engage in those learning environments as active participants, permitting the development of exploration-based learning paradigms. There are separate reviews of VR technologies and serious games for educational and training purposes with a focus on only one knowledge area. However, this review covers 135 proposals for serious games in immersive VR-environments that are combinations of both VR and serious games and that offer end-user validation. First, an analysis of the forum, nationality, and date of publication of the articles is conducted. Then, the application domains, the target audience, the design of the game and its technological implementation, the performance evaluation procedure, and the results are analyzed. The aim here is to identify the factual standards of the proposed solutions and the differences between training and learning applications. Finally, the study lays the basis for future research lines that will develop serious games in immersive VR-environments, providing recommendations for the improvement of these tools and their successful application for the enhancement of both learning and training tasks.
Kordos, Mirosław; Arnaiz-González, Álvar; García-Osorio, César
Evolutionary prototype selection for multi-output regression Journal Article
In: Neurocomputing, vol. 358, pp. 309-320, 2019, ISSN: 0925-2312.
A novel approach to prototype selection for multi-output regression data sets is presented. A multi-objective evolutionary algorithm is used to evaluate the selections using two criteria: training data set compression and prediction quality expressed in terms of root mean squared error. A multi-target regressor based on k-NN was used for that purpose during the training to evaluate the error, while the tests were performed using four different multi-target predictive models. The distance matrices used by the multi-target regressor were cached to accelerate operational performance. Multiple Pareto fronts were also used to prevent overfitting and to obtain a broader range of solutions, by using different probabilities in the initialization of populations and different evolutionary parameters in each one. The results obtained with the benchmark data sets showed that the proposed method greatly reduced data set size and, at the same time, improved the predictive capabilities of the multi-output regressors trained on the reduced data set.
Faithfull, William J; Rodríguez, Juan José; Kuncheva, Ludmila I
In: Information Fusion, vol. 45, pp. 202-214, 2019, ISSN: 1566-2535.
Detecting change in multivariate data is a challenging problem, especially when class labels are not available. There is a large body of research on univariate change detection, notably in control charts developed originally for engineering applications. We evaluate univariate change detection approaches —including those in the MOA framework — built into ensembles where each member observes a feature in the input space of an unsupervised change detection problem. We present a comparison between the ensemble combinations and three established ‘pure’ multivariate approaches over 96 data sets, and a case study on the KDD Cup 1999 network intrusion detection dataset. We found that ensemble combination of univariate methods consistently outperformed multivariate methods on the four experimental metrics.
Kuncheva, Ludmila I; Rodríguez, Juan José
On feature selection protocols for very low-sample-size data Journal Article
In: Pattern Recognition, vol. 81, pp. 660-673, 2018, ISSN: 0031-3203.
High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.
Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César
Local sets for multi-label instance selection Journal Article
In: Applied Soft Computing, vol. 68, pp. 651-666, 2018, ISSN: 1568-4946.
The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets.
In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label.
An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning.
Kuncheva, Ludmila I; Rodríguez, Juan José; Jackson, Aaron S
Restricted set classification: Who is there? Journal Article
In: Pattern Recognition, vol. 63, pp. 158-170, 2017, ISSN: 0031-3203.
We consider a problem where a set X of N objects (instances) coming from c classes have to be classified simultaneously. A restriction is imposed on X in that the maximum possible number of objects from each class is known, hence we dubbed the problem who-is-there? We compare three approaches to this problem: (1) independent classification whereby each object is labelled in the class with the largest posterior probability; (2) a greedy approach which enforces the restriction; and (3) a theoretical approach which, in addition, maximises the likelihood of the label assignment, implemented through the Hungarian assignment algorithm. Our experimental study consists of two parts. The first part includes a custom-made chess data set where the pieces on the chess board must be recognised together from an image of the board. In the second part, we simulate the restricted set classification scenario using 96 datasets from a recently collated repository (University of Santiago de Compostela, USC). Our results show that the proposed approach (3) outperforms approaches (1) and (2).
Sáiz-Manzanares, María Consuelo; Marticorena-Sánchez, Raúl; García-Osorio, César; Díez-Pastor, José Francisco
In: Frontiers in Psychology, vol. 8, pp. 745, 2017, ISSN: 1664-1078.
Learning Management System (LMS) platforms provide a wealth of information on the learning patterns of students. Learning Analytics (LA) techniques permit the analysis of the logs or records of the activities of both students and teachers on the on-line platform. The learning patterns differ depending on the type of Blended Learning (B-Learning). In this study, we analyse: 1) whether significant differences exist between the learning outcomes of students and their learning patterns on the platform, depending on the type of B-Learning [Replacement blend (RB) vs. Supplemental blend (SB)]; 2) whether a relation exists between the metacognitive and the motivational strategies of students, their learning outcomes and their learning patterns on the platform. The 87,065 log records of 129 students (69 in RB and 60 in SB) in the Moodle 3.1 platform were analysed. The results revealed different learning patterns between students depending on the type of B-Learning (RB vs. SB). We have found that the degree of blend, RB vs. SB, seems to condition student behaviour on the platform. Learning patterns in RB environments can predict student learning outcomes. Additionally, in RB environments there is a relationship between the learning patterns and the metacognitive and motivational strategies of the students.
Arnaiz-González, Álvar; Blachnik, Marcin; Kordos, Mirosław; García-Osorio, César
Fusion of instance selection methods in regression tasks Journal Article
In: Information Fusion, vol. 30, pp. 69 - 79, 2016, ISSN: 1566-2535.
Abstract Data pre-processing is a very important aspect of data mining. In this paper we discuss instance selection used for prediction algorithms, which is one of the pre-processing approaches. The purpose of instance selection is to improve the data quality by data size reduction and noise elimination. Until recently, instance selection has been applied mainly to classification problems. Very few recent papers address instance selection for regression tasks. This paper proposes fusion of instance selection algorithms for regression tasks to improve the selection performance. As the members of the ensemble two different families of instance selection methods are evaluated: one based on distance threshold and the other one on converting the regression task into a multiple class classification task. Extensive experimental evaluation performed on the two regression versions of the Edited Nearest Neighbor (ENN) and Condensed Nearest Neighbor (CNN) methods showed that the best performance measured by the error value and data size reduction are in most cases obtained for the ensemble methods.
Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César
Instance selection of linear complexity for big data Journal Article
In: Knowledge-Based Systems, vol. 107, pp. 83–95, 2016, ISSN: 0950-7051.
Abstract Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).
Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César; Kuncheva, Ludmila I
In: Knowledge-Based Systems, vol. 85, pp. 96-111, 2015, ISSN: 0950-7051.
Abstract In Machine Learning, a data set is imbalanced when the class proportions are highly skewed. Class-imbalanced problems sets arise routinely in many application domains and pose a challenge to traditional classifiers. We propose a new approach to building ensembles of classifiers for two-class imbalanced data sets, called Random Balance. Each member of the Random Balance ensemble is trained with data sampled from the training set and augmented by artificial instances obtained using SMOTE. The novelty in the approach is that the proportions of the classes for each ensemble member are chosen randomly. The intuition behind the method is that the proposed diversity heuristic will ensure that the ensemble contains classifiers that are specialized for different operating points on the ROC space, thereby leading to larger AUC compared to other ensembles of classifiers. Experiments have been carried out to test the Random Balance approach by itself, and also in combination with standard ensemble methods. As a result, we propose a new ensemble creation method called RB-Boost which combines Random Balance with AdaBoost.M2. This combination involves enforcing random class proportions in addition to instance re-weighting. Experiments with 86 imbalanced data sets from two well known repositories demonstrate the advantage of the Random Balance approach.
Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César; Kuncheva, Ludmila I
In: Information Sciences, vol. 325, pp. 98 - 117, 2015, ISSN: 0020-0255.
Abstract Many real-life problems can be described as unbalanced, where the number of instances belonging to one of the classes is much larger than the numbers in other classes. Examples are spam detection, credit card fraud detection or medical diagnosis. Ensembles of classifiers have acquired popularity in this kind of problems for their ability to obtain better results than individual classifiers. The most commonly used techniques by those ensembles especially designed to deal with imbalanced problems are for example Re-weighting, Oversampling and Undersampling. Other techniques, originally intended to increase the ensemble diversity, have not been systematically studied for their effect on imbalanced problems. Among these are Random Oracles, Disturbing Neighbors, Random Feature Weights or Rotation Forest. This paper presents an overview and an experimental study of various ensemble-based methods for imbalanced problems, the methods have been tested in its original form and in conjunction with several diversity-increasing techniques, using 84 imbalanced data sets from two well known repositories. This paper shows that these diversity-increasing techniques significantly improve the performance of ensemble methods for imbalanced problems and provides some ideas about when it is more convenient to use these diversifying techniques.