Please use this identifier to cite or link to this item:
|Title:||A comparative study of feature selection methods for binary text streams classification|
|Author:||Moraes, Matheus Bernardelli de|
Gradvohl, Andre Leon Sampaio
|Abstract:||Text streams are a continuous flow of high-dimensional text, transmitted at high-volume and high-velocities. They are expected to be classified in real-time, which is challenging due to the high dimensionality of feature space. Applying feature selection algorithms is one solution to reduce text streams feature space and improve the learning process. However, since text streams are potentially unbounded, it is expected a change in their probabilistic distribution over time, the so-called Concept Drift. The concept drift impacts the feature selection process due to the feature drift when the relevance of features is also subject to changes over time. This paper presents a comparative study of six feature selection methods for binary text streams classification, even in the presence of feature drift. We also propose the Online Feature Selection with Evolving Regularization (OFSER) algorithm, a modified version of the Online Feature Selection (OFS) algorithm, which uses evolving regularization to dynamically penalize model complexity, reducing feature drift impacts on the feature selection process. We conducted the experimental analysis on eleven real-world, commonly used datasets for text classification. The OFSER algorithm showed F1-scores up to 12.92% higher than other algorithms in some cases. The results using Iman and Davenport and Bergmann–Hommel’s tests show that OFSER algorithm is statistically superior to Information Gain and Extremal Feature Selection algorithms in terms of improving the base classifier predictive power|
|Appears in Collections:||FT - Artigos e Outros Documentos|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.