This paper aims to show how to calculate different efficiency measures using a technology estimator defined through the adaptation of the Gradient Tree Boosting algorithm. This adaptation shares some features with the standard nonparametric Free Disposal Hull (FDH) approach, but it overcomes data overfitting problems. Nevertheless, from a computational point of view, the new approach presents thousands of decision variables, making it difficult to solve. To tackle this problem, we also propose and check a heuristic approximation to the exact measures. To demonstrate the applicability of the proposed method, the exact and the heuristic approaches are compared through two empirical applications. The main contributions of this paper are as follows: we build a new bridge between machine learning techniques and technical efficiency measurement. In this framework, we show how to determine the output-oriented and input-oriented radial models, the Russell measure of output efficiency and the Russell measure of input efficiency, as well as the directional distance function and the Enhanced Russell Graph measure. We also prove that the new technique is better, in terms of bias and squared mean error, than the standard FDH technique. Furthermore, we show that the new approach may be seen as a possible remedy for solving the curse of dimensionality problem.
Ensembles with several classifiers (such as neural networks or decision trees) are widely used to improve the generalization performance over a single classifier. Proper diversity among component classifiers is considered an important parameter for ensemble construction so that failure of one may be compensated by others. Among various approaches, data sampling, i.e., different data sets for different classifiers, is found more effective than other approaches. A number of ensemble methods have been proposed under the umbrella of data sampling in which some are constrained to neural networks or decision trees and others are commonly applicable to both types of classifiers. We studied prominent data sampling techniques for neural network ensembles, and then experimentally evaluated their effectiveness on a common test ground. Based on overlap and uncover, the relation between generalization and diversity is presented. Eight ensemble methods were tested on 30 benchmark classification problems. We found that bagging and boosting, the pioneer ensemble methods, are still better than most of the other proposed methods. However, negative correlation learning that implicitly encourages different networks to different training spaces is shown as better or at least comparable to bagging and boosting that explicitly create different training spaces.
The possible application of boosted neural network to particle classification in high energy physics is discussed. A two-dimensional toy model, where the boundary between signal and background is irregular but not overlapping, is constructed to show how boosting technique works with neural network. It is found that boosted neural network not only decreases the error rate of classification significantly but also increases the efficiency and signal–background ratio. Besides, boosted neural network can avoid the disadvantage aspects of single neural network design. The boosted neural network is also applied to the classification of quark- and gluon-jet samples from Monte Carlo e+e- collisions, where the two samples show significant overlapping. The performance of boosting technique for the two different boundary cases — with and without overlapping is discussed.
The paper describes an integrated recognition-by-parts architecture for reliable and robust face recognition. Reliability and robustness are characteristic of the ability to deploy full-fledged and operational biometric engines, and handling adverse image conditions that include among others uncooperative subjects, occlusion, and temporal variability, respectively. The architecture proposed is model-free and non-parametric. The conceptual framework draws support from discriminative methods using likelihood ratios. At the conceptual level it links forensics and biometrics, while at the implementation level it links the Bayesian framework and statistical learning theory (SLT). Layered categorization starts with face detection using implicit rather than explicit segmentation. It proceeds with face authentication that involves feature selection of local patch instances including dimensionality reduction, exemplar-based clustering of patches into parts, and data fusion for matching using boosting driven by parts that play the role of weak-learners. Face authentication shares the same implementation with face detection. The implementation, driven by transduction, employs proximity and typicality (ranking) realized using strangeness and p-values, respectively. The feasibility and reliability of the proposed architecture are illustrated using FRGC data. The paper concludes with suggestions for augmenting and enhancing the scope and utility of the proposed architecture.
A new approach for ensemble construction based on restricting a set of weights of examples in training data to avoid overfitting is proposed in the paper. The algorithm called EPIBoost (Extreme Points Imprecise Boost) applies imprecise statistical models to restrict the set of weights. The updating of the weights within the restricted set is carried out by using its extreme points. The approach allows us to construct various algorithms by applying different imprecise statistical models for producing the restricted set. It is shown by various numerical experiments with real data sets that the EPIBoost algorithm may outperform the standard AdaBoost for some parameters of imprecise statistical models.
Many classification algorithms aim to minimize just their training error count; however, it is often desirable to minimize a more general cost metric, where distinct instances have different costs. In this paper, an instance-based cost-sensitive Bayesian consistent version of exponential loss function is proposed. Using the modified loss function, the derivation of instance-based cost-sensitive extensions of AdaBoost, RealBoost and GentleBoost are developed which are termed as ICSAdaBoost, ICSRealBoost and ICSGentleBoost, respectively. In this research, a new instance-based cost generation method is proposed instead of doing this expensive process by experts. Thus, each sample takes two cost values; a class cost and a sample cost. The first cost is equally assigned to all samples of each class while the second cost is generated according to the probability of each sample within its class probability density function. Experimental results of the proposed schemes imply 12% enhancement in terms of F-measure and 13% on cost-per-sample over a variety of UCI datasets, compared to the state-of-the-art methods. The significant priority of the proposed method is supported by applying the pair of T-tests to the results.
Recently, Factorization Machine (FM) has become more and more popular for recommendation systems due to its effectiveness in finding informative interactions between features. Usually, the weights for the interactions are learned as a low rank weight matrix, which is formulated as an inner product of two low rank matrices. This low rank matrix can help improve the generalization ability of Factorization Machine. However, to choose the rank properly, it usually needs to run the algorithm for many times using different ranks, which clearly is inefficient for some large-scale datasets. To alleviate this issue, we propose an Adaptive Boosting framework of Factorization Machine (AdaFM), which can adaptively search for proper ranks for different datasets without re-training. Instead of using a fixed rank for FM, the proposed algorithm will gradually increase its rank according to its performance until the performance does not grow. Extensive experiments are conducted to validate the proposed method on multiple large-scale datasets. The experimental results demonstrate that the proposed method can be more effective than the state-of-the-art Factorization Machines.
A boosting algorithm, based on the probably approximately correct (PAC) learning model is used to construct an ensemble of neural networks that significantly improves performance (compared to a single network) in optical character recognition (OCR) problems. The effect of boosting is reported on four handwritten image databases consisting of 12000 digits from segmented ZIP Codes from the United States Postal Service and the following from the National Institute of Standards and Technology: 220000 digits, 45000 upper case letters, and 45000 lower case letters. We use two performance measures: the raw error rate (no rejects) and the reject rate required to achieve a 1% error rate on the patterns not rejected. Boosting improved performance significantly, and, in some cases, dramatically.
Class imbalance is a fundamental problem in data mining and knowledge discovery which is encountered in a wide array of application domains. Random undersampling has been widely used to alleviate the harmful effects of imbalance, however, this technique often leads to a substantial amount of information loss. Repetitive undersampling techniques, which generate an ensemble of models, each trained on a different, undersampled subset of the training data, have been proposed to allieviate this difficulty. This work reviews three repetitive undersampling methods currently used to handle imbalance and presents a detailed and comprehensive empirical study using four different learners, four performance metrics and 15 datasets from various application domains. To our knowledge, this work is the most thorough study of repetitive undersampling techniques.
Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.
Defect prediction is very challenging in software development practice. Classification models are useful tools that can help for such prediction. Classification models can classify program modules into quality-based classes, e.g. fault-prone (fp) or not-fault-prone (nfp). This facilitates the allocation of limited project resources. For example, more resources are assigned to program modules that are of poor quality or likely to have a high number of faults based on the classification. However, two main problems, high dimensionality and class imbalance, affect the quality of training datasets and therefore classification models. Feature selection and data sampling are often used to overcome these problems. Feature selection is a process of choosing the most important attributes from the original dataset. Data sampling alters the dataset to change its balance level. Another technique, called boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models), is found to also be effective for resolving the class imbalance problem.
In this study, we investigate an approach for combining feature selection with this ensemble learning (boosting) process. We focus on two different scenarios: feature selection performed prior to the boosting process and feature selection performed inside the boosting process. Ten individual base feature ranking techniques, as well as an ensemble ranker based on the ten, are examined and compared over the two scenarios. We also employ the boosting algorithm to construct classification models without performing feature selection and use the results as the baseline for further comparison. The experimental results demonstrate that feature selection is important and needed prior to the learning process. In addition, the ensemble feature ranking method generally has better or similar performance than the average of the base ranking techniques, and more importantly, the ensemble method exhibits better robustness than most base ranking techniques. As for the two scenarios, the results show that applying feature selection inside boosting performs better than using feature selection prior to boosting.
The noise intolerance and the storage requirements of Nearest-Neighbor-based algorithms are the two main obstacles to their use for solving complex classification tasks. From the beginning of the 70's (with Hart and Gates), many methods have been proposed for dealing with these problems, by eliminating mislabeled instances, and selecting relevant prototypes. These models often have the distinctive feature of optimizing during the process the accuracy. In this paper, we present a new original approach which adapts the properties of boosting (which optimizes an other criterion) to the prototype selection field. While in a standard boosting algorithm the final classifier combines a set of weak hypotheses, where each one is a classifier built according to a given distribution over the training data, we defined in our approach each weak hypothesis as a single weighted prototype. The distribution update (a key step of boosting) and the criterion optimized during the process are slightly modified to allow an efficient adaptation of boosting to the prototype selection field.
In order to show the interest of our new algorithm, called PSBOOST, we achieved a wide experimental study, comparing our procedure with the state-of-the-art prototype selection algorithms. Taking into account many performance measures, such as storage reduction, noise tolerance, generalization accuracy and learning speed, we can claim that PSBOOST seems to be very efficient by providing a good balance between all these performance measures. A statistical analysis is presented to validate all the results.
This research presents a new learning model, the Parallel Decision DAG (PDDAG), and shows how to use it to represent an ensemble of decision trees while using significantly less storage. Ensembles such as Bagging and Boosting have a high probability of encoding redundant data structures, and PDDAGs provide a way to remove this redundancy in decision tree based ensembles. When trained by encoding an ensemble, the new model behaves similar to the original ensemble, and can be made to perform identically to it. The reduced storage requirements allow an ensemble approach to be used in cases where storage requirements would normally be exceeded, and the smaller model can potentially execute faster by reducing redundant computation.
We consider a variation of the problem of combining expert opinions for the situation in which there is no ground truth to use for training. Even though we do not have labeled data, the goal of this work is quite different from an unsupervised learning problem in which the goal is to cluster the data. Our work is motivated by the application of segmenting a lung nodule in a computed tomography (CT) scan of the human chest. The lack of a gold standard of truth is a critical problem in medical imaging. A variety of experts, both human and computer algorithms, are available that can mark which voxels are part of a nodule. The question is, how to combine these expert opinions to estimate the unknown ground truth. We present the Veritas algorithm that predicts the underlying label using the knowledge in the expert opinions even without the benefit of any labeled data for training. We evaluate Veritas using artificial data and real CT images to which synthetic nodules have been added, providing a known ground truth.
We present a new ensemble learning method that employs a set of regional classifiers, each of which learns to handle a subset of the training data. We split the training data and generate classifiers for different regions in the feature space. When classifying an instance, we apply a weighted voting scheme among the classifiers that include the instance in their region. We used 11 datasets to compare the performance of our new ensemble method with that of single classifiers as well as other ensemble methods such as RBE, bagging and Adaboost. As a result, we found that the performance of our method is comparable to that of Adaboost and bagging when the base learner is C4.5. In the remaining cases, our method outperformed other approaches.
This article introduces a novel ensemble method named eAdaBoost (Effective Adaptive Boosting) is a meta classifier which is developed by enhancing the existing AdaBoost algorithm and to handle the time complexity and also to produce the best classification accuracy. The eAdaBoost reduces the error rate when compared with the existing methods and generates the best accuracy by reweighing each feature for further process. The comparison results of an extensive experimental evaluation of the proposed method are explained using the UCI machine learning repository datasets. The accuracy of the classifiers and statistical test comparisons are made with various boosting algorithms. The proposed eAdaBoost has been also implemented with different decision tree classifiers like C4.5, Decision Stump, NB Tree and Random Forest. The algorithm has been computed with various dataset, with different weight thresholds and the performance is analyzed. The proposed method produces better results using random forest and NB tree as base classifier than the decision stump and C4.5 classifiers for few datasets. The eAdaBoost gives better classification accuracy, and prediction accuracy, and execution time is also less when compared with other classifiers.
An extension of the Adaboost algorithm for obtaining fuzzy rule-based systems from low quality data is combined with preprocessing algorithms for equalizing imbalanced datasets. With the help of synthetic and real-world problems, it is shown that the performance of the Adaboost algorithm is degraded in presence of a moderate uncertainty in either the input or the output values. It is also established that a preprocessing stage improves the accuracy of the classifier in a wide range of binary classification problems, including those whose imbalance ratio is uncertain.
Symbolically representing the knowledge acquired by a neural network is a profound endeavor aimed at illuminating the latent information embedded within the network. The literature offers a multitude of algorithms dedicated to extracting symbolic classification rules from neural networks. While some excel in producing highly accurate rules, others specialize in generating rules that are easily comprehensible. Nevertheless, only a scant few algorithms manage to strike a harmonious balance between comprehensibility and accuracy. One such exemplary technique is the Rule Extraction from Neural Network Using Classified and Misclassified Data (RxNCM) algorithm, which adeptly generates straightforward and precise rules outlining input data ranges with commendable accuracy. This article endeavors to enhance the classification performance of the RxNCM algorithm by leveraging ensemble technique. Ensembles, a burgeoning field, focus on augmenting classifier performance by harnessing the strengths of individual classifiers. Extraction of rules through neural network ensembles is relatively underexplored, this paper bridges the gap by introducing the Rule extraction using Neural Network Ensembles (RENNE) algorithm. RENNE is designed to refine the classification rules derived from the RxNCM algorithm through ensemble strategy. Specifically, RENNE leverages patterns correctly predicted by an ensemble of neural networks during the rule generation process. The efficacy of the algorithm is validated using seven datasets sourced from the UCI repository. The outcomes indicate that the proposed RENNE algorithm outperforms the RxNCM algorithm in terms of performance.
In this paper, we investigate the impact of Random Undersampling (RUS) on a supervised Machine Learning task involving highly imbalanced Big Data. We present the results of experiments in Medicare Fraud detection. To the best of our knowledge, these experiments are conducted with the largest insurance claims datasets ever used for Medicare Fraud detection. We obtain two datasets from two Big Data repositories provided by the United States government’s Centers for Medicare and Medicaid Services. The larger of the two datasets contains nearly 174 million instances, with a minority to majority class ratio of approximately 0.0039. Our contribution is to show that RUS has a detrimental effect on a Medicare Fraud detection task when performed on large scale, imbalanced data. The effect of RUS is apparent in the Area Under the Precision Recall Curve (AUPRC) scores recorded from experimental outcomes. We use four popular, open-source classifiers in our experiments to confirm the negative impact of RUS on their AUPRC scores.
Boosting, one of the best off-the-shelf classification methods, has evoked widespread interest in machine learning and statistics. However, the original algorithm was developed for binary classification problems. In this paper, we study multi-class boosting algorithms under the ℓ2-loss framework, and devise two multi-class ℓ2-Boost algorithms. These are based on coordinate descent and gradient descent to minimize the multi-class ℓ2-loss function. We derive a scoring coding scheme using optimal scoring constraints to encode class labels and a simple decoder to recover the true class labels. Our boosting algorithms are easily implemented and their results converge to the global optimum. Experiments with synthetic and real-world datasets show that, compared with several state-of-art methods, our algorithms provide more accurate results.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.