Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

Author:

Year

: 2019

Abstract: As the size of software projects becomes larger, Software Defect Prediction (SDP) will play a key role in allocating testing resources reasonably, reducing testing costs, and speeding up the development process. Most SDP methods have used machine learning techniques based on some software metrics such as Halstead and McCabe\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s cyclomatic. However, many of these metrics usually do not follow Gaussian distribution, and defect and non-defect classes have overlaps. In addition, in many of software defect datasets, the number of defective modules (minority class) are much less than non-defective modules (majority class). In this situation, the performance of machine learning methods is reduced dramatically.
Therefore, we first need to create a balance between minority and majority classes and then transferring the samples into the new space in which the pair samples with the same class (must-link set) are near to each other as most as possible and pair samples with different classes (cannot-link) stay away as far as possible.
To achieve the mentioned objectives, in this paper, we use Mahalanobis distance in two manners. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links.
To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compare its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews Correlation Coefficient (MCC).

DOI: 10.1007/s11227-019-03051-w

URI: http://libsearch.um.ac.ir:80/fum/handle/fum/3369000

Keyword(s): Software Defect Prediction,Software Metrics,Mahalanobis distance,Oversampling,Feature extraction

Collections :

ProfDoc

Show Full MetaData Hide Full MetaData
Statistics

Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance

Show full item record

contributor author	محمدمهدی نژادشکوهی	en
contributor author	سیدمحمدعلی مجیدی انوری	en
contributor author	عباس رسول زادگان	en
contributor author	mohammadmahdi Nshokoohi	fa
contributor author	SeyedMohammadAli MajidiAnvari	fa
contributor author	Abbas Rasoolzadegan	fa
date accessioned	2020-06-06T13:47:15Z
date available	2020-06-06T13:47:15Z
date issued	2019
identifier uri	http://libsearch.um.ac.ir:80/fum/handle/fum/3369000
description abstract	As the size of software projects becomes larger, Software Defect Prediction (SDP) will play a key role in allocating testing resources reasonably, reducing testing costs, and speeding up the development process. Most SDP methods have used machine learning techniques based on some software metrics such as Halstead and McCabe\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s cyclomatic. However, many of these metrics usually do not follow Gaussian distribution, and defect and non-defect classes have overlaps. In addition, in many of software defect datasets, the number of defective modules (minority class) are much less than non-defective modules (majority class). In this situation, the performance of machine learning methods is reduced dramatically. Therefore, we first need to create a balance between minority and majority classes and then transferring the samples into the new space in which the pair samples with the same class (must-link set) are near to each other as most as possible and pair samples with different classes (cannot-link) stay away as far as possible. To achieve the mentioned objectives, in this paper, we use Mahalanobis distance in two manners. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links. To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compare its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews Correlation Coefficient (MCC).	en
language	English
title	Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance	en
type	Journal Paper
contenttype	External Fulltext
subject keywords	Software Defect Prediction	en
subject keywords	Software Metrics	en
subject keywords	Mahalanobis distance	en
subject keywords	Oversampling	en
subject keywords	Feature extraction	en
identifier doi	10.1007/s11227-019-03051-w
journal title	Journal of Supercomputing	fa
identifier link	https://profdoc.um.ac.ir/paper-abstract-1076479.html
identifier articleid	1076479