Data Balancing Technique for Multi-Class Imbalanced Problems
Pages : 459-463
Download PDF
Abstract
The imbalanced dataset contains skewed distribution of data. Such data distribution generates difficulties for machine learning algorithms. These algorithms also fail to generate accurate results in case of data imbalance, overlapping of class boundaries and hybrid datasets. Various techniques proposed in a literature to balance a dataset using oversampling or under sampling methods. The study of these techniques is done independently. A little work has been done with the combined study of these two techniques. The proposed system focuses on the study and implementation of oversampling and under-sampling together to balance a dataset. The technique is generalized for hybrid datasets. Cluster based under sampling approach is used followed by the Mahalanobis Distancebased Over-sampling technique. The data will be tested on multiple hybrid datasets and classification accuracy using C4.5 algorithm will be evaluated. The accuracy results will be compared with the individual oversampling and under sampling approach.
Keywords: Oversampling, under sampling, hybrid dataset, Mahalanobis distance, cluster based under sampling, Imbalance data, Classification