Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Classification, an important data mining function that assigns class label to items in a collection, is of practical applications in various domains. In software engineering, for instance, a common classification problem is to determine the quality of a software item. In such a problem, software metrics represent the independent features while the fault proneness represents the class label. With many classification problems, one must often deal with the presence of irrelevant features in the feature space. That, coupled with class imbalance, renders the task of discriminating one class from another rather difficult. In this study, we empirically evaluate our proposed wrapper-based feature ranking where nine performance metrics aided by a particular learner and a methodology are considered. We examine five learners and take three different approaches, each in conjunction with one of three different methodologies: 3-fold Cross-Validation, 3-fold Cross-Validation Risk Impact, and a combination of the two. In this study, we consider two sets of software engineering datasets. To evaluate the classifier performance after feature selection has been applied, we use Area Under Receiver Operating Characteristic curve as the performance evaluator. We investigate the performance of feature selection as we vary the three factors that form the foundation of the wrapper-based feature ranking. We show that the performance is conditioned by not only the choice of methodology but also the learner. We also evaluate the effect of sampling on wrapper-based feature ranking. Finally, we provide guidance as to which software metrics are relevant in software defect prediction problems and how the number of software metrics can be selected when using wrapper-based feature ranking.
A rule-based classification model is presented to identify high-risk software modules. It utilizes the power of rough set theory to reduce the number of attributes, and the equal frequency binning algorithm to partition the values of the attributes. As a result, a set of conjuncted Boolean predicates are formed. The model is inherently influenced by the practical needs of the system being modeled, thus allowing the analyst to determine which rules are to be used for classifying the fault-prone and not fault-prone modules. The proposed model also enables the analyst to control the number of rules that constitute the model. Empirical validation of the model is accomplished through a case study of a large legacy telecommunications system. The ease of rule interpretation and the transparency of the functional aspects of the model are clearly demonstrated. It is concluded that the new model is effective in achieving the software quality classification.
Inspection is widely believed to be one of the most cost-effective methods for detection of defects in the work products produced during software development. However, the inspection process, by its very nature, is labor intensive and for delivering value, they have to be properly executed and controlled. While controlling the inspection process, the inspection module size is a key control parameter. Larger module size can lead to an increased leakage of defects which increases the cost since rework in the subsequent phases is more expensive. Small module size reduces the defect leakage but increases the number of inspections. In this paper, we formulate a cost model for an inspection process using which the total cost can be minimized. We then use the technique of Design of Experiments to study how the optimum module size varies with some of the key parameters of the inspection process, and determine the optimum module size for different situations.
Embedded-computer systems have become essential to life in modern society. For example, the backbone of society's information infrastructure is telecommunications. Embedded systems must have highly reliable software, so that we avoid the severe consequences of failures, intolerable down-time, and expensive repairs in remote locations. Moreover, today's fast-moving technology marketplace mandates that embedded systems evolve, resulting in multiple software releases embedded in multiple products.
Software quality models can be valuable tools for software engineering of embedded systems, because some software-enhancement techniques are so expensive or time-consuming that it is not practical to apply them to all modules. Targeting such enhancement techniques is an effective way to reduce the likelihood of faults discovered in the field. Research has shown software metrics to be useful predictors of software faults. A software quality model is developed using measurements and fault data from a past release. The calibrated model is then applied to modules currently under development. Such models yield predictions on a module-by-module basis.
This paper examines the Classification And Regression Trees (CART) algorithm for building tree-based models that predict which software modules have high risk of faults to be discovered during operations. CART is attractive because it emphasizes pruning to achieve robust models. This paper presents details on the CART algorithm in the context of software engineering of embedded systems. We illustrate this approach with a case study of four consecutive releases of software embedded in a large telecommunications system. The level of accuracy achieved in the case study would be useful to developers of an embedded system. The case study indicated that this model would continue to be useful over several releases as the system evolves.
Reliable software is mandatory for complex mission-critical systems. Classifying modules as fault-prone, or not, is a valuable technique for guiding development processes, so that resources can be focused on those parts of a system that are most likely to have faults.
Logistic regression offers advantages over other classification modeling techniques, such as interpretable coefficients. There are few prior applications of logistic regression to software quality models in the literature, and none that we know of account for prior probabilities and costs of misclassification. A contribution of this paper is the application of prior probabilities and costs of misclassification to a logistic regression-based classification rule for a software quality model.
This paper also contributes an integrated method for using logistic regression in software quality modeling, including examples of how to interpret coefficients, how to use prior probabilities, and how to use costs of misclassifications. A case study of a major subsystem of a military, real-time system illustrates the techniques.
Software measurement and modeling is intended to improve quality by predicting quality factors, such as reliability, early in the life cycle. The field of software measurement generally assumes that attributes of software products early in the life cycle are somehow related to the amount of information in those products, and thus, are related to the quality that eventually results from the development process.
Kolmogorov complexity and information theory offer a way to quantify the amount of information in a finite object, such as a program, in a unifying framework. Based on these principles, we propose a new synthetic measure of information composed from a set of conventional primitive metrics in a module. Since not all information is equally relevant to fault-insertion, we also consider components of the overall information content. We present a model for fault-insertion based on a nonhomogeneous Poisson process and Poisson regression. This approach is attractive, because the underlying assumptions are appropriate for software quality data. This approach also gives insight into design attributes that affect fault insertion.
A validation case study of a large sample of modules from a very large telecommunications system provides empirical evidence that the components of synthetic module complexity can be useful in software quality modeling. A large telecommunications system is an example of a computer system with rigorous software quality requirements.
This paper presents a method proposal for estimation of software reliability before the implementation phase. The method is based upon that a formal specification technique is used and that it is possible to develop a tool performing dynamic analysis, i.e., locating semantic faults in the design. The analysis is performed with both applying a usage profile as input as well as doing a full analysis, i.e., locate all faults that the tool can find. The tool must provide failure data in terms of time since the last failure was detected. The mapping of the dynamic failures to the failures encountered during statistical usage testing and operation is discussed. The method can be applied either on the software specification or as a step in the development process by applying it on the software design. The proposed method allows for software reliability estimations that can be used both as a quality indicator, and for planning and controlling resources, development times, etc. at an early stage in the development of software systems.