Depending on the amount of data to process, file generation may take longer.

If it takes too long to generate, you can limit the data by, for example, reducing the range of years.

Article

Download BibTeX

Title

SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

Authors

[ 1 ] Instytut Informatyki, Wydział Informatyki, Politechnika Poznańska | [ P ] employee

Year of publication

2015

Published in

Information Sciences

Journal year: 2015 | Journal volume: vol. 291

Article type

scientific article

Publication language

english

Keywords
EN
  • imbalanced classification
  • borderline examples
  • noisy data
  • noise filters
  • SMOTE
Abstract

EN Classification datasets often have an unequal class distribution among their examples. This problem is known as imbalanced classification. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most well-know data pre-processing methods to cope with it and to balance the different number of examples of each class. However, as recent works claim, class imbalance is not a problem in itself and performance degradation is also associated with other factors related to the distribution of the data. One of these is the presence of noisy and borderline examples, the latter lying in the areas surrounding class boundaries. Certain intrinsic limitations of SMOTE can aggravate the problem produced by these types of examples and current generalizations of SMOTE are not correctly adapted to their treatment. This paper proposes the extension of SMOTE through a new element, an iterative ensemble-based noise filter called Iterative-Partitioning Filter (IPF), which can overcome the problems produced by noisy and borderline examples in imbalanced datasets. This extension results in SMOTE–IPF. The properties of this proposal are discussed in a comprehensive experimental study. It is compared against a basic SMOTE and its most well-known generalizations. The experiments are carried out both on a set of synthetic datasets with different levels of noise and shapes of borderline examples as well as real-world datasets. Furthermore, the impact of introducing additional different types and levels of noise into these real-world data is studied. The results show that the new proposal performs better than existing SMOTE generalizations for all these different scenarios. The analysis of these results also helps to identify the characteristics of IPF which differentiate it from other filtering approaches.

Pages (from - to)

184 - 203

DOI

10.1016/j.ins.2014.08.051

URL

https://www.sciencedirect.com/science/article/pii/S0020025514008561

Ministry points / journal

45

Impact Factor

3,364

This website uses cookies to remember the authenticated session of the user. For more information, read about Cookies and Privacy Policy.