2021
Juez-Gil, Mario; Arnaiz-González, Álvar; Rodríguez, Juan José; López-Nozal, Carlos; García-Osorio, César
Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark Journal Article
In: Neurocomputing, vol. 464, pp. 432-437, 2021, ISSN: 0925-2312.
Abstract | Links | BibTeX | Tags: Big data, Data Mining, imbalance, SMOTE, Spark
@article{Juez-Gil2021bb,
title = {Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark},
author = {Mario Juez-Gil and Álvar Arnaiz-González and Juan José Rodríguez and Carlos López-Nozal and César García-Osorio},
url = {https://www.sciencedirect.com/science/article/pii/S0925231221012832},
doi = {https://doi.org/10.1016/j.neucom.2021.08.086},
issn = {0925-2312},
year = {2021},
date = {2021-11-13},
journal = {Neurocomputing},
volume = {464},
pages = {432-437},
abstract = {One of the main goals of Big Data research, is to find new data mining methods that are able to process large amounts of data in acceptable times. In Big Data classification, as in traditional classification, class imbalance is a common problem that must be addressed, in the case of Big Data also looking for a solution that can be applied in an acceptable execution time. In this paper we present Approx-SMOTE, a parallel implementation of the SMOTE algorithm for the Apache Spark framework. The key difference with the original SMOTE, besides parallelism, is that it uses an approximated version of k-Nearest Neighbor which makes it highly scalable. Although an implementation of SMOTE for Big Data already exists (SMOTE-BD), it uses an exact Nearest Neighbor search, which does not make it entirely scalable. Approx-SMOTE on the other hand is able to achieve up to 30 times faster run times without sacrificing the improved classification performance offered by the original SMOTE.},
keywords = {Big data, Data Mining, imbalance, SMOTE, Spark},
pubstate = {published},
tppubtype = {article}
}
One of the main goals of Big Data research, is to find new data mining methods that are able to process large amounts of data in acceptable times. In Big Data classification, as in traditional classification, class imbalance is a common problem that must be addressed, in the case of Big Data also looking for a solution that can be applied in an acceptable execution time. In this paper we present Approx-SMOTE, a parallel implementation of the SMOTE algorithm for the Apache Spark framework. The key difference with the original SMOTE, besides parallelism, is that it uses an approximated version of k-Nearest Neighbor which makes it highly scalable. Although an implementation of SMOTE for Big Data already exists (SMOTE-BD), it uses an exact Nearest Neighbor search, which does not make it entirely scalable. Approx-SMOTE on the other hand is able to achieve up to 30 times faster run times without sacrificing the improved classification performance offered by the original SMOTE.
Juez-Gil, Mario; Arnaiz-González, Álvar; Rodríguez, Juan José; García-Osorio, César
Experimental evaluation of ensemble classifiers for imbalance in Big Data Journal Article
In: Applied Soft Computing, vol. 108, no. 107447, 2021, ISSN: 1568-4946.
Abstract | Links | BibTeX | Tags: Big data, ensemble, imbalance, resampling, Spark, unbalance
@article{Juez-Gil2021b,
title = {Experimental evaluation of ensemble classifiers for imbalance in Big Data},
author = {Mario Juez-Gil and Álvar Arnaiz-González and Juan José Rodríguez and César García-Osorio},
url = {https://www.sciencedirect.com/science/article/pii/S1568494621003707?via%3Dihub},
doi = {10.1016/j.asoc.2021.107447},
issn = {1568-4946},
year = {2021},
date = {2021-09-01},
journal = {Applied Soft Computing},
volume = {108},
number = {107447},
abstract = {Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data.},
keywords = {Big data, ensemble, imbalance, resampling, Spark, unbalance},
pubstate = {published},
tppubtype = {article}
}
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data.