2021
Juez-Gil, Mario; Arnaiz-González, Álvar; Rodríguez, Juan José; García-Osorio, César
Experimental evaluation of ensemble classifiers for imbalance in Big Data Journal Article
In: Applied Soft Computing, vol. 108, no. 107447, 2021, ISSN: 1568-4946.
Abstract | Links | BibTeX | Tags: Big data, ensemble, imbalance, resampling, Spark, unbalance
@article{Juez-Gil2021b,
title = {Experimental evaluation of ensemble classifiers for imbalance in Big Data},
author = {Mario Juez-Gil and Álvar Arnaiz-González and Juan José Rodríguez and César García-Osorio},
url = {https://www.sciencedirect.com/science/article/pii/S1568494621003707?via%3Dihub},
doi = {10.1016/j.asoc.2021.107447},
issn = {1568-4946},
year = {2021},
date = {2021-09-01},
journal = {Applied Soft Computing},
volume = {108},
number = {107447},
abstract = {Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data.},
keywords = {Big data, ensemble, imbalance, resampling, Spark, unbalance},
pubstate = {published},
tppubtype = {article}
}
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data.
Rodríguez, Juan José; Juez-Gil, Mario; López-Nozal, Carlos; Arnaiz-González, Álvar
Rotation Forest for multi-target regression Journal Article
In: International Journal of Machine Learning and Cybernetics, 2021, ISSN: 1868-808X.
Abstract | Links | BibTeX | Tags: ensemble, multi-target regression, Rotation forest, SELECTED
@article{Rodríguez2021,
title = {Rotation Forest for multi-target regression},
author = {Juan José Rodríguez and Mario Juez-Gil and Carlos López-Nozal and Álvar Arnaiz-González},
url = {https://link.springer.com/article/10.1007/s13042-021-01329-1},
doi = {https://doi.org/10.1007/s13042-021-01329-1},
issn = {1868-808X},
year = {2021},
date = {2021-04-22},
journal = {International Journal of Machine Learning and Cybernetics},
abstract = {The prediction of multiple numeric outputs at the same time is called multi-target regression (MTR), and it has gained attention during the last decades. This task is a challenging research topic in supervised learning because it poses additional difficulties to traditional single-target regression (STR), and many real-world problems involve the prediction of multiple targets at once. One of the most successful approaches to deal with MTR, although not the only one, consists in transforming the problem in several STR problems, whose outputs will be combined building up the MTR output. In this paper, the Rotation Forest ensemble method, previously proposed for single-label classification and single-target regression, is adapted to MTR tasks and tested with several regressors and data sets. Our proposal rotates the input space in an efficient and novel fashion, avoiding extra rotations forced by MTR problem decomposition. Four approaches for MTR are used: single-target (ST), stacked-single target (SST), Ensembles of Regressor Chains (ERC), and Multi-target Regression via Quantization (MRQ). For assessing the benefits of the proposal, a thorough experimentation with 28 MTR data sets and statistical tests are used, concluding that Rotation Forest, adapted by means of these approaches, outperforms other popular ensembles, such as Bagging and Random Forest.},
keywords = {ensemble, multi-target regression, Rotation forest, SELECTED},
pubstate = {published},
tppubtype = {article}
}
The prediction of multiple numeric outputs at the same time is called multi-target regression (MTR), and it has gained attention during the last decades. This task is a challenging research topic in supervised learning because it poses additional difficulties to traditional single-target regression (STR), and many real-world problems involve the prediction of multiple targets at once. One of the most successful approaches to deal with MTR, although not the only one, consists in transforming the problem in several STR problems, whose outputs will be combined building up the MTR output. In this paper, the Rotation Forest ensemble method, previously proposed for single-label classification and single-target regression, is adapted to MTR tasks and tested with several regressors and data sets. Our proposal rotates the input space in an efficient and novel fashion, avoiding extra rotations forced by MTR problem decomposition. Four approaches for MTR are used: single-target (ST), stacked-single target (SST), Ensembles of Regressor Chains (ERC), and Multi-target Regression via Quantization (MRQ). For assessing the benefits of the proposal, a thorough experimentation with 28 MTR data sets and statistical tests are used, concluding that Rotation Forest, adapted by means of these approaches, outperforms other popular ensembles, such as Bagging and Random Forest.