W zależności od ilości danych do przetworzenia generowanie pliku może się wydłużyć.

Jeśli generowanie trwa zbyt długo można ograniczyć dane np. zmniejszając zakres lat.

Rozdział

Pobierz BibTeX

Tytuł

Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data

Autorzy

[ 1 ] Instytut Informatyki, Wydział Informatyki, Politechnika Poznańska | [ P ] pracownik

Rok publikacji

2013

Typ rozdziału

rozdział w monografii naukowej

Język publikacji

angielski

Streszczenie

EN This paper deals with inducing classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining classes (majority classes). The minority class is usually of primary interest and it is required to recognize its members as accurately as possible. Class imbalance constitutes a difficulty for most algorithms learning classifiers as they are biased toward the majority classes. The first part of this study is devoted to discussing main properties of data that cause this difficulty. Following the review of earlier, related research several types of artificial, imbalanced data sets affected by critical factors have been generated. The decision trees and rule based classifiers have been generated from these data sets. Results of first experiments show that too small number of examples from the minority class is not the main source of difficulties. These results confirm the initial hypothesis saying the degradation of classification performance is more related to the minority class decomposition into small sub-parts. Another critical factor concerns presence of a relatively large number of borderline examples from the minority class in the overlapping region between classes, in particular for non-linear decision boundaries. The novel observation is showing the impact of rare examples from the minority class located inside the majority class. The experiments make visible that stepwise increasing the number of borderline and rare examples in the minority class has larger influence on the considered classifiers than increasing the decomposition of this class. The second part of this paper is devoted to studying an improvement of classifiers by pre-processing of such data with resampling methods. Next experiments examine the influence of the identified critical data factors on performance of 4 different pre-processing re-sampling methods: two versions of random over-sampling, focused under-sampling NCR and the hybrid method SPIDER. Results show that if data is sufficiently disturbed by borderline and rare examples SPIDER and partly NCR work better than over-sampling.

Strony (od-do)

277 - 306

DOI

10.1007/978-3-642-28699-5_11

URL

https://link.springer.com/chapter/10.1007/978-3-642-28699-5_11

Książka

Emerging paradigms in machine learning

Ta strona używa plików Cookies, w celu zapamiętania uwierzytelnionej sesji użytkownika. Aby dowiedzieć się więcej przeczytaj o plikach Cookies i Polityce Prywatności.