Andrews curves Applied Machine Learning Boosting Business intelligence Chomsky normal form Class-imbalanced problems Classifier ensembles Cocke-Younger-Kasami algorithm Computer Science teaching Data analysis Data Mining Data reduction Data visualization Decision trees Disturbing neighbors End of studies project Ensemble methods Ensembles Exploratory data analysis Exploratory projection pursuit Finite automata Grammars Imbalanced data Instance selection Linear projections LL parsing Neural networks Parsing algorithms Random forest Random oracles Regression Regression ensembles Regression trees Regular expressions Rotation forest Self organizing maps Support vector machines surface roughness Turing machines Undersampling

## 2018 |

Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César Local sets for multi-label instance selection Journal Article In: Applied Soft Computing, 68 , pp. 651-666, 2018, ISSN: 1568-4946. Abstract | Links | BibTeX | Tags: Data reduction, Instance selection, Local set, Multi-label classification, Nearest neighbor @article{Arnaiz-González2018b, title = {Local sets for multi-label instance selection}, author = {Álvar Arnaiz-González and José Francisco Díez-Pastor and Juan José Rodríguez and César García-Osorio}, url = {https://www.sciencedirect.com/science/article/pii/S1568494618302072}, doi = {10.1016/j.asoc.2018.04.016}, issn = {1568-4946}, year = {2018}, date = {2018-07-01}, journal = {Applied Soft Computing}, volume = {68}, pages = {651-666}, abstract = {The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets. In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label. An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning.}, keywords = {Data reduction, Instance selection, Local set, Multi-label classification, Nearest neighbor}, pubstate = {published}, tppubtype = {article} } The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets. In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label. An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning. |

## 2016 |

Arnaiz-González, Álvar; Díez-Pastor, José Francisco; Rodríguez, Juan José; García-Osorio, César Instance selection of linear complexity for big data Journal Article In: Knowledge-Based Systems, 107 , pp. 83–95, 2016, ISSN: 0950-7051. Abstract | Links | BibTeX | Tags: Big data, Data Mining, Data reduction, Hashing, Instance selection, Nearest neighbors @article{ArnaizGonzálezLSHIS2016, title = {Instance selection of linear complexity for big data}, author = {Álvar Arnaiz-González and José Francisco Díez-Pastor and Juan José Rodríguez and César García-Osorio}, url = {http://www.sciencedirect.com/science/article/pii/S0950705116301617}, doi = {10.1016/j.knosys.2016.05.056}, issn = {0950-7051}, year = {2016}, date = {2016-01-01}, journal = {Knowledge-Based Systems}, volume = {107}, pages = {83--95}, abstract = {Abstract Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).}, keywords = {Big data, Data Mining, Data reduction, Hashing, Instance selection, Nearest neighbors}, pubstate = {published}, tppubtype = {article} } Abstract Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances). |