Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance
نویسنده:
, , , , ,سال
: 2019
چکیده: As the size of software projects becomes larger, Software Defect Prediction (SDP) will play a key role in allocating testing resources reasonably, reducing testing costs, and speeding up the development process. Most SDP methods have used machine learning techniques based on some software metrics such as Halstead and McCabe\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s cyclomatic. However, many of these metrics usually do not follow Gaussian distribution, and defect and non-defect classes have overlaps. In addition, in many of software defect datasets, the number of defective modules (minority class) are much less than non-defective modules (majority class). In this situation, the performance of machine learning methods is reduced dramatically.
Therefore, we first need to create a balance between minority and majority classes and then transferring the samples into the new space in which the pair samples with the same class (must-link set) are near to each other as most as possible and pair samples with different classes (cannot-link) stay away as far as possible.
To achieve the mentioned objectives, in this paper, we use Mahalanobis distance in two manners. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links.
To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compare its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews Correlation Coefficient (MCC).
Therefore, we first need to create a balance between minority and majority classes and then transferring the samples into the new space in which the pair samples with the same class (must-link set) are near to each other as most as possible and pair samples with different classes (cannot-link) stay away as far as possible.
To achieve the mentioned objectives, in this paper, we use Mahalanobis distance in two manners. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links.
To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compare its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews Correlation Coefficient (MCC).
شناسه الکترونیک: 10.1007/s11227-019-03051-w
کلیدواژه(گان): Software Defect Prediction,Software Metrics,Mahalanobis distance,Oversampling,Feature extraction
کالکشن
:
-
آمار بازدید
Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance
Show full item record
contributor author | محمدمهدی نژادشکوهی | en |
contributor author | سیدمحمدعلی مجیدی انوری | en |
contributor author | عباس رسول زادگان | en |
contributor author | mohammadmahdi Nshokoohi | fa |
contributor author | SeyedMohammadAli MajidiAnvari | fa |
contributor author | Abbas Rasoolzadegan | fa |
date accessioned | 2020-06-06T13:47:15Z | |
date available | 2020-06-06T13:47:15Z | |
date issued | 2019 | |
identifier uri | http://libsearch.um.ac.ir:80/fum/handle/fum/3369000 | |
description abstract | As the size of software projects becomes larger, Software Defect Prediction (SDP) will play a key role in allocating testing resources reasonably, reducing testing costs, and speeding up the development process. Most SDP methods have used machine learning techniques based on some software metrics such as Halstead and McCabe\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s cyclomatic. However, many of these metrics usually do not follow Gaussian distribution, and defect and non-defect classes have overlaps. In addition, in many of software defect datasets, the number of defective modules (minority class) are much less than non-defective modules (majority class). In this situation, the performance of machine learning methods is reduced dramatically. Therefore, we first need to create a balance between minority and majority classes and then transferring the samples into the new space in which the pair samples with the same class (must-link set) are near to each other as most as possible and pair samples with different classes (cannot-link) stay away as far as possible. To achieve the mentioned objectives, in this paper, we use Mahalanobis distance in two manners. First, the minority class is oversampled based on the Mahalanobis distance such that generated synthetic data are more diverse from other minority data, and minority class distribution is not changed significantly. Second, a feature extraction method based on Mahalanobis distance metric learning is used which try to minimize distances of sample pairs in must-links and maximize the distance of sample pairs in cannot-links. To demonstrate the effectiveness of the proposed method, we performed some experiments on 12 publicly available datasets which are collected NASA repositories and compare its result by some powerful previous methods. The performance is evaluated in F-measure, G-Mean, and Matthews Correlation Coefficient (MCC). | en |
language | English | |
title | Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance | en |
type | Journal Paper | |
contenttype | External Fulltext | |
subject keywords | Software Defect Prediction | en |
subject keywords | Software Metrics | en |
subject keywords | Mahalanobis distance | en |
subject keywords | Oversampling | en |
subject keywords | Feature extraction | en |
identifier doi | 10.1007/s11227-019-03051-w | |
journal title | Journal of Supercomputing | fa |
identifier link | https://profdoc.um.ac.ir/paper-abstract-1076479.html | |
identifier articleid | 1076479 |