## 2019 |

Rodríguez, Juan José; Díez-Pastor, José Francisco; Arnaiz-González, Álvar; Kuncheva, Ludmila I Random Balance ensembles for multiclass imbalance learning Journal Article Forthcoming In: Knowledge-Based Systems, Forthcoming, ISSN: 0950-7051. Abstract | Links | BibTeX | Tags: Classifier ensembles, Imbalanced data, Multiclass classification @article{Rodríguez2019, title = {Random Balance ensembles for multiclass imbalance learning}, author = {Juan José Rodríguez and José Francisco Díez-Pastor and Álvar Arnaiz-González and Ludmila I Kuncheva}, url = {https://www.sciencedirect.com/science/article/pii/S0950705119306598}, doi = {10.1016/j.knosys.2019.105434}, issn = {0950-7051}, year = {2019}, date = {2019-12-27}, journal = {Knowledge-Based Systems}, abstract = {Random Balance strategy (RandBal) has been recently proposed for constructing classifier ensembles for imbalanced, two-class data sets. In RandBal, each base classifier is trained with a sample of the data with a random class prevalence, independent of the a priori distribution. Hence, for each sample, one of the classes will be undersampled while the other will be oversampled. RandBal can be applied on its own or can be combined with any other ensemble method. One particularly successful variant is RandBalBoost which integrates Random Balance and boosting. Encouraged by the success of RandBal, this work proposes two approaches which extend RandBal to multiclass imbalance problems. Multiclass imbalance implies that at least two classes have substantially different proportion of instances. In the first approach proposed here, termed Multiple Random Balance (MultiRandBal), we deal with all classes simultaneously. The training data for each base classifier are sampled with random class proportions. The second approach we propose decomposes the multiclass problem into two-class problems using one-vs-one or one-vs-all, and builds an ensemble of RandBal ensembles. We call the two versions of the second approach OVO-RandBal and OVA-RandBal, respectively. These two approaches were chosen because they are the most straightforward extensions of RandBal for multiple classes. Our main objective is to evaluate both approaches for multiclass imbalanced problems. To this end, an experiment was carried out with 52 multiclass data sets. The results suggest that both MultiRandBal, and OVO/OVA-RandBal are viable extensions of the original two-class RandBal. Collectively, they consistently outperform acclaimed state-of-the art methods for multiclass imbalanced problems.}, keywords = {Classifier ensembles, Imbalanced data, Multiclass classification}, pubstate = {forthcoming}, tppubtype = {article} } Random Balance strategy (RandBal) has been recently proposed for constructing classifier ensembles for imbalanced, two-class data sets. In RandBal, each base classifier is trained with a sample of the data with a random class prevalence, independent of the a priori distribution. Hence, for each sample, one of the classes will be undersampled while the other will be oversampled. RandBal can be applied on its own or can be combined with any other ensemble method. One particularly successful variant is RandBalBoost which integrates Random Balance and boosting. Encouraged by the success of RandBal, this work proposes two approaches which extend RandBal to multiclass imbalance problems. Multiclass imbalance implies that at least two classes have substantially different proportion of instances. In the first approach proposed here, termed Multiple Random Balance (MultiRandBal), we deal with all classes simultaneously. The training data for each base classifier are sampled with random class proportions. The second approach we propose decomposes the multiclass problem into two-class problems using one-vs-one or one-vs-all, and builds an ensemble of RandBal ensembles. We call the two versions of the second approach OVO-RandBal and OVA-RandBal, respectively. These two approaches were chosen because they are the most straightforward extensions of RandBal for multiple classes. Our main objective is to evaluate both approaches for multiclass imbalanced problems. To this end, an experiment was carried out with 52 multiclass data sets. The results suggest that both MultiRandBal, and OVO/OVA-RandBal are viable extensions of the original two-class RandBal. Collectively, they consistently outperform acclaimed state-of-the art methods for multiclass imbalanced problems. |

Kuncheva, Ludmila I; Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Gunn, Iain A D Instance selection improves geometric mean accuracy: a study on imbalanced data classification Journal Article In: Progress in Artificial Intelligence, 8 (2), pp. 215-228, 2019, ISSN: 2192-6352. Abstract | Links | BibTeX | Tags: Ensemble methods, geometric mean (GM), Imbalanced data, instance/prototype selection, nearest neighbour, Theoretical perspective @article{Kuncheva2019, title = {Instance selection improves geometric mean accuracy: a study on imbalanced data classification}, author = {Ludmila I Kuncheva and Álvar Arnaiz-González and José Francisco Díez-Pastor and Iain A D Gunn}, url = {https://link.springer.com/article/10.1007/s13748-019-00172-4?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst&utm_source=ArticleAuthorContributingOnlineFirst&utm_medium=email&utm_content=AA_en_06082018&ArticleAuthorContributingOnlineFirst_20190209}, doi = {10.1007/s13748-019-00172-4}, issn = {2192-6352}, year = {2019}, date = {2019-06-01}, journal = {Progress in Artificial Intelligence}, volume = {8}, number = {2}, pages = {215-228}, abstract = {A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.}, keywords = {Ensemble methods, geometric mean (GM), Imbalanced data, instance/prototype selection, nearest neighbour, Theoretical perspective}, pubstate = {published}, tppubtype = {article} } A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data. |

## 2018 |

Kuncheva, Ludmila I; Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Gunn, Iain A D Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification Journal Article In: arXiv, 2018. Abstract | Links | BibTeX | Tags: Ensemble methods, geometric mean (GM), Imbalanced data, instance/prototype selection, nearest neighbour @article{Kuncheva2018, title = {Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification}, author = {Ludmila I Kuncheva and Álvar Arnaiz-González and José Francisco Díez-Pastor and Iain A D Gunn}, url = {https://arxiv.org/abs/1804.07155}, doi = {arXiv:1804.07155v1}, year = {2018}, date = {2018-04-19}, journal = {arXiv}, abstract = {A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data. }, keywords = {Ensemble methods, geometric mean (GM), Imbalanced data, instance/prototype selection, nearest neighbour}, pubstate = {published}, tppubtype = {article} } A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data. |

# Publications

Andrews curves Applied Machine Learning Boosting Business intelligence Cascading Chomsky normal form Class-imbalanced problems Classifier ensembles Cocke-Younger-Kasami algorithm Computer Science teaching Data analysis Data Mining Data visualization Decision trees Disturbing neighbors End of studies project Ensemble methods Ensembles Exploratory data analysis Exploratory projection pursuit Finite automata geometric mean (GM) Grammars Imbalanced data Instance selection Linear projections LL parsing Neural networks Parsing algorithms Random forest Random oracles Regression Regression ensembles Regression trees Regular expressions Rotation forest Self organizing maps Subspace methods Support vector machines surface roughness

## 2019 |

Random Balance ensembles for multiclass imbalance learning Journal Article Forthcoming In: Knowledge-Based Systems, Forthcoming, ISSN: 0950-7051. |

Instance selection improves geometric mean accuracy: a study on imbalanced data classification Journal Article In: Progress in Artificial Intelligence, 8 (2), pp. 215-228, 2019, ISSN: 2192-6352. |

## 2018 |

Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification Journal Article In: arXiv, 2018. |