2022
Ramos-Pérez, Ismael; Arnaiz-González, Álvar; Rodríguez, Juan José; García-Osorio, César
When is resampling beneficial for feature selection with imbalanced wide data? Journal Article
In: Expert Systems with Applications, vol. 188, pp. 116015, 2022, ISSN: 0957-4174.
Abstract | Links | BibTeX | Tags: Feature selection, High dimensional data, Machine learning, SELECTED, Unbalanced, Very low sample size, Wide data
@article{Ramos-Pérez2022,
title = {When is resampling beneficial for feature selection with imbalanced wide data?},
author = {Ismael Ramos-Pérez and Álvar Arnaiz-González and Juan José Rodríguez and César García-Osorio},
url = {https://www.sciencedirect.com/science/article/pii/S0957417421013622},
doi = {https://doi.org/10.1016/j.eswa.2021.116015},
issn = {0957-4174},
year = {2022},
date = {2022-02-01},
journal = {Expert Systems with Applications},
volume = {188},
pages = {116015},
abstract = {This paper studies the effects that combinations of balancing and feature selection techniques have on wide data (many more attributes than instances) when different classifiers are used. For this, an extensive study is done using 14 datasets, 3 balancing strategies, and 7 feature selection algorithms. The evaluation is carried out using 5 classification algorithms, analyzing the results for different percentages of selected features, and establishing the statistical significance using Bayesian tests.
Some general conclusions of the study are that it is better to use RUS before the feature selection, while ROS and SMOTE offer better results when applied afterwards. Additionally, specific results are also obtained depending on the classifier used, for example, for Gaussian SVM the best performance is obtained when the feature selection is done with SVM-RFE before balancing the data with RUS.},
keywords = {Feature selection, High dimensional data, Machine learning, SELECTED, Unbalanced, Very low sample size, Wide data},
pubstate = {published},
tppubtype = {article}
}
This paper studies the effects that combinations of balancing and feature selection techniques have on wide data (many more attributes than instances) when different classifiers are used. For this, an extensive study is done using 14 datasets, 3 balancing strategies, and 7 feature selection algorithms. The evaluation is carried out using 5 classification algorithms, analyzing the results for different percentages of selected features, and establishing the statistical significance using Bayesian tests.
Some general conclusions of the study are that it is better to use RUS before the feature selection, while ROS and SMOTE offer better results when applied afterwards. Additionally, specific results are also obtained depending on the classifier used, for example, for Gaussian SVM the best performance is obtained when the feature selection is done with SVM-RFE before balancing the data with RUS.
Some general conclusions of the study are that it is better to use RUS before the feature selection, while ROS and SMOTE offer better results when applied afterwards. Additionally, specific results are also obtained depending on the classifier used, for example, for Gaussian SVM the best performance is obtained when the feature selection is done with SVM-RFE before balancing the data with RUS.
2018
Kuncheva, Ludmila I; Rodríguez, Juan José
On feature selection protocols for very low-sample-size data Journal Article
In: Pattern Recognition, vol. 81, pp. 660-673, 2018, ISSN: 0031-3203.
Abstract | Links | BibTeX | Tags: Cross-validation, Experimental protocol, Feature selection, SELECTED, Training/testing, Wide datasets
@article{Kuncheva2018b,
title = {On feature selection protocols for very low-sample-size data},
author = {Ludmila I Kuncheva and Juan José Rodríguez},
url = {https://www.sciencedirect.com/science/article/pii/S003132031830102X},
doi = {10.1016/j.patcog.2018.03.012},
issn = {0031-3203},
year = {2018},
date = {2018-09-01},
journal = {Pattern Recognition},
volume = {81},
pages = {660-673},
abstract = {High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.},
keywords = {Cross-validation, Experimental protocol, Feature selection, SELECTED, Training/testing, Wide datasets},
pubstate = {published},
tppubtype = {article}
}
High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.