Article Preview
Top1. Introduction
Software defect classification is a very crucial task to help the software defect management process. The size and number of software systems are increasing day by day. The defects are being reported by the users of these software systems. Classifying defects into a category helps software developers to assign priorities to the defects, faster resolution, analysis of a defect prone module etc. (Endres1975; Shepherd1993; Wagner2008). Classifying these defects is a time consuming process which is done manually by software developers. In recent years, supervised learning methods have been used to automate the process of defect classification using orthogonal defect classification.
Orthogonal defect classification (ODC) (Chillarege et al.1992 1996) was developed by IBM in the 1990s to provide measuring software process by extracting valuable information from defects. It acts as a bridge between defect modeling and causal analysis. It groups defects based on their impact, trigger, activity, target, type, source, qualifier and age. ODC defect impact specifies a user experience when a defect occurs. Defect impact is further classified into usability, reliability, standard, install-ability, security maintenance etc. In recent years many works have been focused on orthogonal defect classification that tried to categorize the defects based on ODC attributes. ODC has been successfully used by many organizations to improve their process of software development (Butcher 2002;Soylemez and Tarhan 2013;Bridge and Miller 1998;Mays et al. 1990;Lutz and Mikulski 2004; Zheng et al.2006).
In this paper we have evaluated five classifiers namely Naïve Bayes, Support Vector Machine, K Nearest Neighbor, Random Forest and Decision Tree.
In brief, the main contributions of this paper are:
- •
Defect categorization from unstructured text provided in the description field of defect reports for 4096 defects from three datasets MongoDB, Cassandra and HBase
- •
Selection of most relevant features using chi square score
- •
Evaluate the performance of Naïve Bayes, Support Vector Machine, K Nearest Neighbor, Random Forest and Decision Tree dataset wise and with whole data
The rest of this paper is organized in various sections. Section 2 presents the background studies, definitions and related work focuses on software defect categorization using orthogonal defect classification. The next section discusses the dataset. In Section 4, we define the problem and explain our proposed approach; the results are discussed in section 5 in comparison with an existing approach. Section 6 discusses the threats to validity to our work. Last section concludes this paper with conclusion and future scope.