2020
|
Juez-Gil, Mario; Saucedo-Dorantes, Juan José; Arnaiz-González, Álvar; López-Nozal, Carlos; García-Osorio, César; Lowe, David Early and extremely early multi-label fault diagnosis in induction motors Journal Article ISA Transactions, 106 , pp. 367-381, 2020, ISSN: 0019-0578. Abstract | Links | BibTeX @article{Juez-Gil2020,
title = {Early and extremely early multi-label fault diagnosis in induction motors},
author = {Mario Juez-Gil and Juan José Saucedo-Dorantes and Álvar Arnaiz-González and Carlos López-Nozal and César García-Osorio and David Lowe},
url = {https://www.sciencedirect.com/science/article/pii/S0019057820302755},
doi = {https://doi.org/10.1016/j.isatra.2020.07.002},
issn = {0019-0578},
year = {2020},
date = {2020-11-01},
journal = {ISA Transactions},
volume = {106},
pages = {367-381},
abstract = {The detection of faulty machinery and its automated diagnosis is an industrial priority because efficient fault diagnosis implies efficient management of the maintenance times, reduction of energy consumption, reduction in overall costs and, most importantly, the availability of the machinery is ensured. Thus, this paper presents a new intelligent multi-fault diagnosis method based on multiple sensor information for assessing the occurrence of single, combined, and simultaneous faulty conditions in an induction motor. The contribution and novelty of the proposed method include the consideration of different physical magnitudes such as vibrations, stator currents, voltages, and rotational speed as a meaningful source of information of the machine condition. Moreover, for each available physical magnitude, the reduction of the original number of attributes through the Principal Component Analysis leads to retain a reduced number of significant features that allows achieving the final diagnosis outcome by a multi-label classification tree. The effectiveness of the method was validated by using a complete set of experimental data acquired from a laboratory electromechanical system, where a healthy and seven faulty scenarios were assessed. Also, the interpretation of the results do not require any prior expert knowledge and the robustness of this proposal allows its application in industrial applications, since it may deal with different operating conditions such as different loads and operating frequencies. Finally, the performance was evaluated using multi-label measures, which to the best of our knowledge, is an innovative development in the field condition monitoring and fault identification.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The detection of faulty machinery and its automated diagnosis is an industrial priority because efficient fault diagnosis implies efficient management of the maintenance times, reduction of energy consumption, reduction in overall costs and, most importantly, the availability of the machinery is ensured. Thus, this paper presents a new intelligent multi-fault diagnosis method based on multiple sensor information for assessing the occurrence of single, combined, and simultaneous faulty conditions in an induction motor. The contribution and novelty of the proposed method include the consideration of different physical magnitudes such as vibrations, stator currents, voltages, and rotational speed as a meaningful source of information of the machine condition. Moreover, for each available physical magnitude, the reduction of the original number of attributes through the Principal Component Analysis leads to retain a reduced number of significant features that allows achieving the final diagnosis outcome by a multi-label classification tree. The effectiveness of the method was validated by using a complete set of experimental data acquired from a laboratory electromechanical system, where a healthy and seven faulty scenarios were assessed. Also, the interpretation of the results do not require any prior expert knowledge and the robustness of this proposal allows its application in industrial applications, since it may deal with different operating conditions such as different loads and operating frequencies. Finally, the performance was evaluated using multi-label measures, which to the best of our knowledge, is an innovative development in the field condition monitoring and fault identification. |
2019
|
Kordos, Mirosław; Arnaiz-González, Álvar; García-Osorio, César Evolutionary prototype selection for multi-output regression Journal Article Neurocomputing, 358 , pp. 309-320, 2019, ISSN: 0925-2312. Abstract | Links | BibTeX @article{Kordos2019,
title = {Evolutionary prototype selection for multi-output regression},
author = {Mirosław Kordos and Álvar Arnaiz-González and César García-Osorio},
url = {https://www.sciencedirect.com/science/article/pii/S0925231219307611?fbclid=IwAR1qb5kLk1-PyqfAPprRnb6Jv75rMgJS3dY1rDqWF610G2lCttEW3QIBU4c},
doi = {10.1016/j.neucom.2019.05.055},
issn = {0925-2312},
year = {2019},
date = {2019-09-17},
journal = {Neurocomputing},
volume = {358},
pages = {309-320},
abstract = {A novel approach to prototype selection for multi-output regression data sets is presented. A multi-objective evolutionary algorithm is used to evaluate the selections using two criteria: training data set compression and prediction quality expressed in terms of root mean squared error. A multi-target regressor based on k-NN was used for that purpose during the training to evaluate the error, while the tests were performed using four different multi-target predictive models. The distance matrices used by the multi-target regressor were cached to accelerate operational performance. Multiple Pareto fronts were also used to prevent overfitting and to obtain a broader range of solutions, by using different probabilities in the initialization of populations and different evolutionary parameters in each one. The results obtained with the benchmark data sets showed that the proposed method greatly reduced data set size and, at the same time, improved the predictive capabilities of the multi-output regressors trained on the reduced data set.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
A novel approach to prototype selection for multi-output regression data sets is presented. A multi-objective evolutionary algorithm is used to evaluate the selections using two criteria: training data set compression and prediction quality expressed in terms of root mean squared error. A multi-target regressor based on k-NN was used for that purpose during the training to evaluate the error, while the tests were performed using four different multi-target predictive models. The distance matrices used by the multi-target regressor were cached to accelerate operational performance. Multiple Pareto fronts were also used to prevent overfitting and to obtain a broader range of solutions, by using different probabilities in the initialization of populations and different evolutionary parameters in each one. The results obtained with the benchmark data sets showed that the proposed method greatly reduced data set size and, at the same time, improved the predictive capabilities of the multi-output regressors trained on the reduced data set. |
Faithfull, William J; Rodríguez, Juan José; Kuncheva, Ludmila I Combining univariate approaches for ensemble change detection in multivariate data Journal Article Information Fusion, 45 , pp. 202-214, 2019, ISSN: 1566-2535. Abstract | Links | BibTeX @article{Faithfull2019,
title = {Combining univariate approaches for ensemble change detection in multivariate data},
author = {William J Faithfull and Juan José Rodríguez and Ludmila I Kuncheva},
url = {https://www.sciencedirect.com/science/article/pii/S1566253517301239},
doi = {10.1016/j.inffus.2018.02.003},
issn = {1566-2535},
year = {2019},
date = {2019-01-01},
journal = {Information Fusion},
volume = {45},
pages = {202-214},
abstract = {Detecting change in multivariate data is a challenging problem, especially when class labels are not available. There is a large body of research on univariate change detection, notably in control charts developed originally for engineering applications. We evaluate univariate change detection approaches —including those in the MOA framework — built into ensembles where each member observes a feature in the input space of an unsupervised change detection problem. We present a comparison between the ensemble combinations and three established ‘pure’ multivariate approaches over 96 data sets, and a case study on the KDD Cup 1999 network intrusion detection dataset. We found that ensemble combination of univariate methods consistently outperformed multivariate methods on the four experimental metrics.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Detecting change in multivariate data is a challenging problem, especially when class labels are not available. There is a large body of research on univariate change detection, notably in control charts developed originally for engineering applications. We evaluate univariate change detection approaches —including those in the MOA framework — built into ensembles where each member observes a feature in the input space of an unsupervised change detection problem. We present a comparison between the ensemble combinations and three established ‘pure’ multivariate approaches over 96 data sets, and a case study on the KDD Cup 1999 network intrusion detection dataset. We found that ensemble combination of univariate methods consistently outperformed multivariate methods on the four experimental metrics. |
2018
|
Kuncheva, Ludmila I; Rodríguez, Juan José On feature selection protocols for very low-sample-size data Journal Article Pattern Recognition, 81 , pp. 660-673, 2018, ISSN: 0031-3203. Abstract | Links | BibTeX @article{Kuncheva2018b,
title = {On feature selection protocols for very low-sample-size data},
author = {Ludmila I Kuncheva and Juan José Rodríguez},
url = {https://www.sciencedirect.com/science/article/pii/S003132031830102X},
doi = {10.1016/j.patcog.2018.03.012},
issn = {0031-3203},
year = {2018},
date = {2018-09-01},
journal = {Pattern Recognition},
volume = {81},
pages = {660-673},
abstract = {High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol. |
Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César Local sets for multi-label instance selection Journal Article Applied Soft Computing, 68 , pp. 651-666, 2018, ISSN: 1568-4946. Abstract | Links | BibTeX @article{Arnaiz-González2018b,
title = {Local sets for multi-label instance selection},
author = {Álvar Arnaiz-González and José Francisco Díez-Pastor and Juan José Rodríguez and César García-Osorio},
url = {https://www.sciencedirect.com/science/article/pii/S1568494618302072},
doi = {10.1016/j.asoc.2018.04.016},
issn = {1568-4946},
year = {2018},
date = {2018-07-01},
journal = {Applied Soft Computing},
volume = {68},
pages = {651-666},
abstract = {The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets.
In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label.
An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets.
In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label.
An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning. |
2017
|
Kuncheva, Ludmila I; Rodríguez, Juan José; Jackson, Aaron S Restricted set classification: Who is there? Journal Article Pattern Recognition, 63 , pp. 158-170, 2017, ISSN: 0031-3203. Abstract | Links | BibTeX @article{Kuncheva2017,
title = {Restricted set classification: Who is there?},
author = {Ludmila I Kuncheva and Juan José Rodríguez and Aaron S Jackson},
url = {https://www.sciencedirect.com/science/article/pii/S0031320316302412},
doi = {10.1016/j.patcog.2016.08.028},
issn = {0031-3203},
year = {2017},
date = {2017-03-01},
journal = {Pattern Recognition},
volume = {63},
pages = {158-170},
abstract = {We consider a problem where a set X of N objects (instances) coming from c classes have to be classified simultaneously. A restriction is imposed on X in that the maximum possible number of objects from each class is known, hence we dubbed the problem who-is-there? We compare three approaches to this problem: (1) independent classification whereby each object is labelled in the class with the largest posterior probability; (2) a greedy approach which enforces the restriction; and (3) a theoretical approach which, in addition, maximises the likelihood of the label assignment, implemented through the Hungarian assignment algorithm. Our experimental study consists of two parts. The first part includes a custom-made chess data set where the pieces on the chess board must be recognised together from an image of the board. In the second part, we simulate the restricted set classification scenario using 96 datasets from a recently collated repository (University of Santiago de Compostela, USC). Our results show that the proposed approach (3) outperforms approaches (1) and (2).},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We consider a problem where a set X of N objects (instances) coming from c classes have to be classified simultaneously. A restriction is imposed on X in that the maximum possible number of objects from each class is known, hence we dubbed the problem who-is-there? We compare three approaches to this problem: (1) independent classification whereby each object is labelled in the class with the largest posterior probability; (2) a greedy approach which enforces the restriction; and (3) a theoretical approach which, in addition, maximises the likelihood of the label assignment, implemented through the Hungarian assignment algorithm. Our experimental study consists of two parts. The first part includes a custom-made chess data set where the pieces on the chess board must be recognised together from an image of the board. In the second part, we simulate the restricted set classification scenario using 96 datasets from a recently collated repository (University of Santiago de Compostela, USC). Our results show that the proposed approach (3) outperforms approaches (1) and (2). |
Sáiz-Manzanares, María Consuelo; Marticorena-Sánchez, Raúl; García-Osorio, César; Díez-Pastor, José Francisco How Do B-Learning and Learning Patterns Influence Learning Outcomes? Journal Article Frontiers in Psychology, 8 , pp. 745, 2017, ISSN: 1664-1078. Abstract | Links | BibTeX @article{10.3389/fpsyg.2017.00745,
title = {How Do B-Learning and Learning Patterns Influence Learning Outcomes?},
author = {María Consuelo Sáiz-Manzanares and Raúl Marticorena-Sánchez and César García-Osorio and José Francisco Díez-Pastor},
url = {http://journal.frontiersin.org/article/10.3389/fpsyg.2017.00745},
doi = {10.3389/fpsyg.2017.00745},
issn = {1664-1078},
year = {2017},
date = {2017-01-01},
journal = {Frontiers in Psychology},
volume = {8},
pages = {745},
abstract = {Learning Management System (LMS) platforms provide a wealth of information on the learning patterns of students. Learning Analytics (LA) techniques permit the analysis of the logs or records of the activities of both students and teachers on the on-line platform. The learning patterns differ depending on the type of Blended Learning (B-Learning). In this study, we analyse: 1) whether significant differences exist between the learning outcomes of students and their learning patterns on the platform, depending on the type of B-Learning [Replacement blend (RB) vs. Supplemental blend (SB)]; 2) whether a relation exists between the metacognitive and the motivational strategies of students, their learning outcomes and their learning patterns on the platform. The 87,065 log records of 129 students (69 in RB and 60 in SB) in the Moodle 3.1 platform were analysed. The results revealed different learning patterns between students depending on the type of B-Learning (RB vs. SB). We have found that the degree of blend, RB vs. SB, seems to condition student behaviour on the platform. Learning patterns in RB environments can predict student learning outcomes. Additionally, in RB environments there is a relationship between the learning patterns and the metacognitive and motivational strategies of the students.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Learning Management System (LMS) platforms provide a wealth of information on the learning patterns of students. Learning Analytics (LA) techniques permit the analysis of the logs or records of the activities of both students and teachers on the on-line platform. The learning patterns differ depending on the type of Blended Learning (B-Learning). In this study, we analyse: 1) whether significant differences exist between the learning outcomes of students and their learning patterns on the platform, depending on the type of B-Learning [Replacement blend (RB) vs. Supplemental blend (SB)]; 2) whether a relation exists between the metacognitive and the motivational strategies of students, their learning outcomes and their learning patterns on the platform. The 87,065 log records of 129 students (69 in RB and 60 in SB) in the Moodle 3.1 platform were analysed. The results revealed different learning patterns between students depending on the type of B-Learning (RB vs. SB). We have found that the degree of blend, RB vs. SB, seems to condition student behaviour on the platform. Learning patterns in RB environments can predict student learning outcomes. Additionally, in RB environments there is a relationship between the learning patterns and the metacognitive and motivational strategies of the students. |
2016
|
Arnaiz-González, Álvar; Blachnik, Marcin; Kordos, Mirosław; García-Osorio, César Fusion of instance selection methods in regression tasks Journal Article Information Fusion, 30 , pp. 69 - 79, 2016, ISSN: 1566-2535. Abstract | Links | BibTeX @article{ArnaizGonzalez201669,
title = {Fusion of instance selection methods in regression tasks},
author = {Álvar Arnaiz-González and Marcin Blachnik and Mirosław Kordos and César García-Osorio},
url = {http://www.sciencedirect.com/science/article/pii/S1566253515001141},
doi = {10.1016/j.inffus.2015.12.002},
issn = {1566-2535},
year = {2016},
date = {2016-01-01},
journal = {Information Fusion},
volume = {30},
pages = {69 - 79},
abstract = {Abstract Data pre-processing is a very important aspect of data mining. In this paper we discuss instance selection used for prediction algorithms, which is one of the pre-processing approaches. The purpose of instance selection is to improve the data quality by data size reduction and noise elimination. Until recently, instance selection has been applied mainly to classification problems. Very few recent papers address instance selection for regression tasks. This paper proposes fusion of instance selection algorithms for regression tasks to improve the selection performance. As the members of the ensemble two different families of instance selection methods are evaluated: one based on distance threshold and the other one on converting the regression task into a multiple class classification task. Extensive experimental evaluation performed on the two regression versions of the Edited Nearest Neighbor (ENN) and Condensed Nearest Neighbor (CNN) methods showed that the best performance measured by the error value and data size reduction are in most cases obtained for the ensemble methods.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Abstract Data pre-processing is a very important aspect of data mining. In this paper we discuss instance selection used for prediction algorithms, which is one of the pre-processing approaches. The purpose of instance selection is to improve the data quality by data size reduction and noise elimination. Until recently, instance selection has been applied mainly to classification problems. Very few recent papers address instance selection for regression tasks. This paper proposes fusion of instance selection algorithms for regression tasks to improve the selection performance. As the members of the ensemble two different families of instance selection methods are evaluated: one based on distance threshold and the other one on converting the regression task into a multiple class classification task. Extensive experimental evaluation performed on the two regression versions of the Edited Nearest Neighbor (ENN) and Condensed Nearest Neighbor (CNN) methods showed that the best performance measured by the error value and data size reduction are in most cases obtained for the ensemble methods. |
Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César Instance selection of linear complexity for big data Journal Article Knowledge-Based Systems, 107 , pp. 83–95, 2016, ISSN: 0950-7051. Abstract | Links | BibTeX @article{ArnaizGonzálezLSHIS2016,
title = {Instance selection of linear complexity for big data},
author = {Álvar Arnaiz-González and José Francisco Díez-Pastor and Juan José Rodríguez and César García-Osorio},
url = {http://www.sciencedirect.com/science/article/pii/S0950705116301617},
doi = {10.1016/j.knosys.2016.05.056},
issn = {0950-7051},
year = {2016},
date = {2016-01-01},
journal = {Knowledge-Based Systems},
volume = {107},
pages = {83--95},
abstract = {Abstract Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Abstract Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances). |
2015
|
Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César; Kuncheva, Ludmila I Random Balance: Ensembles of variable priors classifiers for imbalanced data Journal Article Knowledge-Based Systems, 85 , pp. 96-111, 2015, ISSN: 0950-7051. Abstract | Links | BibTeX @article{RandomBalance,
title = {Random Balance: Ensembles of variable priors classifiers for imbalanced data},
author = {José Francisco Díez-Pastor and Juan José Rodríguez and César García-Osorio and Ludmila I Kuncheva},
url = {http://www.sciencedirect.com/science/article/pii/S0950705115001720},
doi = {10.1016/j.knosys.2015.04.022},
issn = {0950-7051},
year = {2015},
date = {2015-01-01},
journal = {Knowledge-Based Systems},
volume = {85},
pages = {96-111},
abstract = {Abstract In Machine Learning, a data set is imbalanced when the class proportions are highly skewed. Class-imbalanced problems sets arise routinely in many application domains and pose a challenge to traditional classifiers. We propose a new approach to building ensembles of classifiers for two-class imbalanced data sets, called Random Balance. Each member of the Random Balance ensemble is trained with data sampled from the training set and augmented by artificial instances obtained using SMOTE. The novelty in the approach is that the proportions of the classes for each ensemble member are chosen randomly. The intuition behind the method is that the proposed diversity heuristic will ensure that the ensemble contains classifiers that are specialized for different operating points on the ROC space, thereby leading to larger AUC compared to other ensembles of classifiers. Experiments have been carried out to test the Random Balance approach by itself, and also in combination with standard ensemble methods. As a result, we propose a new ensemble creation method called RB-Boost which combines Random Balance with AdaBoost.M2. This combination involves enforcing random class proportions in addition to instance re-weighting. Experiments with 86 imbalanced data sets from two well known repositories demonstrate the advantage of the Random Balance approach.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Abstract In Machine Learning, a data set is imbalanced when the class proportions are highly skewed. Class-imbalanced problems sets arise routinely in many application domains and pose a challenge to traditional classifiers. We propose a new approach to building ensembles of classifiers for two-class imbalanced data sets, called Random Balance. Each member of the Random Balance ensemble is trained with data sampled from the training set and augmented by artificial instances obtained using SMOTE. The novelty in the approach is that the proportions of the classes for each ensemble member are chosen randomly. The intuition behind the method is that the proposed diversity heuristic will ensure that the ensemble contains classifiers that are specialized for different operating points on the ROC space, thereby leading to larger AUC compared to other ensembles of classifiers. Experiments have been carried out to test the Random Balance approach by itself, and also in combination with standard ensemble methods. As a result, we propose a new ensemble creation method called RB-Boost which combines Random Balance with AdaBoost.M2. This combination involves enforcing random class proportions in addition to instance re-weighting. Experiments with 86 imbalanced data sets from two well known repositories demonstrate the advantage of the Random Balance approach. |
Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César; Kuncheva, Ludmila I Diversity techniques improve the performance of the best imbalance learning ensembles Journal Article Information Sciences, 325 , pp. 98 - 117, 2015, ISSN: 0020-0255. Abstract | Links | BibTeX @article{DiezPastor201598,
title = {Diversity techniques improve the performance of the best imbalance learning ensembles},
author = {José Francisco Díez-Pastor and Juan José Rodríguez and César García-Osorio and Ludmila I Kuncheva},
url = {http://www.sciencedirect.com/science/article/pii/S0020025515005186},
doi = {10.1016/j.ins.2015.07.025},
issn = {0020-0255},
year = {2015},
date = {2015-01-01},
journal = {Information Sciences},
volume = {325},
pages = {98 - 117},
abstract = {Abstract Many real-life problems can be described as unbalanced, where the number of instances belonging to one of the classes is much larger than the numbers in other classes. Examples are spam detection, credit card fraud detection or medical diagnosis. Ensembles of classifiers have acquired popularity in this kind of problems for their ability to obtain better results than individual classifiers. The most commonly used techniques by those ensembles especially designed to deal with imbalanced problems are for example Re-weighting, Oversampling and Undersampling. Other techniques, originally intended to increase the ensemble diversity, have not been systematically studied for their effect on imbalanced problems. Among these are Random Oracles, Disturbing Neighbors, Random Feature Weights or Rotation Forest. This paper presents an overview and an experimental study of various ensemble-based methods for imbalanced problems, the methods have been tested in its original form and in conjunction with several diversity-increasing techniques, using 84 imbalanced data sets from two well known repositories. This paper shows that these diversity-increasing techniques significantly improve the performance of ensemble methods for imbalanced problems and provides some ideas about when it is more convenient to use these diversifying techniques.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Abstract Many real-life problems can be described as unbalanced, where the number of instances belonging to one of the classes is much larger than the numbers in other classes. Examples are spam detection, credit card fraud detection or medical diagnosis. Ensembles of classifiers have acquired popularity in this kind of problems for their ability to obtain better results than individual classifiers. The most commonly used techniques by those ensembles especially designed to deal with imbalanced problems are for example Re-weighting, Oversampling and Undersampling. Other techniques, originally intended to increase the ensemble diversity, have not been systematically studied for their effect on imbalanced problems. Among these are Random Oracles, Disturbing Neighbors, Random Feature Weights or Rotation Forest. This paper presents an overview and an experimental study of various ensemble-based methods for imbalanced problems, the methods have been tested in its original form and in conjunction with several diversity-increasing techniques, using 84 imbalanced data sets from two well known repositories. This paper shows that these diversity-increasing techniques significantly improve the performance of ensemble methods for imbalanced problems and provides some ideas about when it is more convenient to use these diversifying techniques. |