Albeit the considerable advances that are achieved by convolutional neural networks (CNN) in active learning, simultaneously investing multiple criteria such as uncertainty and representativeness when sampling informative images still remains a challenging issue. In this work, we develop and evaluate an active learning method called Exploring Uncertainty and Representativeness in Deep Active Learning (EURDAL), to iteratively boost the generalization capacity of the CNN backbone for image classification. To be specific, we introduce two adversarial image classifiers appended to the CNN backbone, and the uncertainty is expounded by the prediction discrepancy of the two adversarial classifiers. Besides, the (n+1)-tuplet loss, which was developed in deep metric learning, is imposed on the CNN backbone, to force CNN to extract discriminative features of unlabeled images. After that, the representativeness is defined by the distance proportion between the given example to its own centroid and the given example to all cluster centroids. With the integration of the learned uncertainty and representativeness, the most informative images will be selected while the noisy ones will be suppressed at each iteration of active supervision. Intensive image classification experiments are conducted on three benchmark datasets, and the encouraging results demonstrate the superiority and effectiveness of our proposed EURDAL network compared with some current competitive active learning methods.
In the last years, the number of machine learning algorithms and their parameters has increased significantly. On the one hand, this increases the chances of finding better models. On the other hand, it increases the complexity of the task of training a model, as the search space expands significantly. As the size of datasets also grows, traditional approaches based on extensive search start to become prohibitively expensive in terms of computational resources and time, especially in data streaming scenarios. This paper describes an approach based on meta-learning that tackles two main challenges. The first is to predict key performance indicators of machine learning models. The second is to recommend the best algorithm/configuration for training a model for a given machine learning problem. When compared to a state-of-the-art method (AutoML), the proposed approach is up to 130x faster and only 4% worse in terms of average model quality. Hence, it is especially suited for scenarios in which models need to be updated regularly, such as in streaming scenarios with big data, in which some accuracy can be traded for a much shorter model training time.
Web document ranking arises in many information retrieval (IR) applications, such as the search engine, recommendation system and online advertising. A challenging issue is how to select the representative query-document pairs and informative features as well for better learning and exploring new ranking models to produce an acceptable ranking list of candidate documents of each query. In this study, we propose an active sampling (AS) plus kernel principal component analysis (KPCA) based ranking model, viz. AS-KPCA Regression, to study the document ranking for a retrieval system, i.e. how to choose the representative query-document pairs and features for learning. More precisely, we fill those documents gradually into the training set by AS such that each of which will incur the highest expected DCG loss if unselected. Then, the KPCA is performed via projecting the selected query-document pairs onto p-principal components in the feature space to complete the regression. Hence, we can cut down the computational overhead and depress the impact incurred by noise simultaneously. To the best of our knowledge, we are the first to perform the document ranking via dimension reductions in two dimensions, namely, the number of documents and features simultaneously. Our experiments demonstrate that the performance of our approach is better than that of the baseline methods on the public LETOR 4.0 datasets. Our approach brings an improvement against RankBoost as well as other baselines near 20% in terms of MAP metric and less improvements using P@K and NDCG@K, respectively. Moreover, our approach is particularly suitable for document ranking on the noisy dataset in practice.
High-fidelity beamline models typically involve particle-tracking and particle–matter interaction, which are intensive computationally demanding and time-consuming. This has led researchers to adopt data-driven surrogate models as an alternative to complex physics simulations. Training a data-driven model requires many labeled data, prompting researchers to simulate diverse beamline settings and obtain corresponding labels. However, the required dataset grows increasingly large as the number of adjustable parameters increases. Therefore, this study proposes a data-efficient surrogate modeling method that employs a student–teacher framework combined with active learning (AL) query strategies to minimize labeled samples while ensuring model accuracy. The proposed method is evaluated on the energy selection system (ESS) design of the Huazhong University of Science and Technology proton therapy facility (HUST-PTF). The results show that: (i) Training the surrogate model using 684 labeled samples selected via query strategy achieves a relative error below 5% for 90% of the samples in the test set. (ii) Compared to the beamline model built by Beam Delivery Simulation (BDSIM), the computational efficiency of the surrogate model is enhanced by a factor of 𝒪(107).
Semisupervised community detection algorithms use prior knowledge to improve the performance of discovering the community structure of a complex network. However, getting those prior knowledge is quite expensive and time consuming in many real-world applications. This paper proposes an active semisupervised community detection algorithm based on the similarities between nodes. First, it transforms a given complex network into a weighted directed network based on the proposed asymmetric similarity method, some informative nodes are selected to be the labeled nodes by using an active mechanism. Second, the proposed algorithm discovers the community structure of a complex network by propagating the community labels of labeled nodes to their neighbors based on the similarity between a node and a community. Finally, the performance of the proposed algorithm is evaluated with three real networks and one synthetic network and the experimental results show that the proposed method has a better performance compared with some other community detection algorithms.
In semi-supervised learning, when the number of data samples with class label information is very small, information from unlabeled data is utilized in the learning process. Many semi-supervised learning methods have been presented and have exhibited competent performance. Active learning also aims to overcome the shortage of labeled data by obtaining class labels for some selected unlabeled data from experts. However, the selection process for the most informative unlabeled data samples can be demanding when the search is performed over a large set of unlabeled data. In this paper, we propose a method for batch mode active learning in graph-based semi-supervised learning. Instead of acquiring class label information of one unlabeled data sample at a time, we obtain information about several data samples at once, reducing time complexity while preserving the beneficial effects of active learning. Experimental results demonstrate the improved performance of the proposed method.
Multi-label active learning for image classification has been a popular research topic. It faces several challenges, even though related work has made great progress. Existing studies on multi-label active learning do not pay attention to the cleanness of sample data. In reality, data are easily polluted by external influences that are likely to disturb the exploration of data space and have a negative effect on model training. Previous methods of label correlation mining, which are purely based on observed label distribution, are defective. Apart from neglecting noise influence, they also cannot acquire sufficient relevant information. In fact, they neglect inner relation mapping from example space to label space, which is an implicit way of modeling label relationships. To solve these issues, we develop a novel multi-label active learning with low-rank application (ENMAL) algorithm in this paper. A low-rank model is constructed to quantize noise level, and the example-label pairs that contain less noise are emphasized when sampling. A low-rank mapping matrix is learned to signify the mapping relation of a multi-label domain to capture a more comprehensive and reasonable label correlation. Integrating label correlation with uncertainty and considering sample noise, an efficient sampling strategy is developed. We extend ENMAL with automatic labeling (denoted as AL-ENMAL) to further reduce the annotation workload of active learning. Empirical research demonstrates the efficacy of our approaches.
Good trimap is essential for high-quality alpha matte. However, making high-quality trimap is hardwork, especially for complex images. In this paper, an active learning framework is proposed to make high quality trimap. There are two active learning methods which are employed: minimization of uncertainty sampling (MUS) and maximization of expected model output change (EMOC). MUS model finds the informative area in image which can decrease the uncertain sampling of alpha matte. EMOC model finds the important areas in image which can give the maximum expected output change of alpha matte. Two methods are combined to define the active map. Active map shows important areas which are informative in image. It can help users to make high quality trimap. The analysis and evaluation of benchmark datasets show that proposed method is effective.
Cancer prediction from gene expression data is a very challenging area of research in the field of computational biology and bioinformatics. Conventional classifiers are often unable to achieve desired accuracy due to the lack of ‘sufficient’ training patterns in terms of clinically labeled samples. Active learning technique, in this respect, can be useful as it automatically finds only few most informative (or confusing) samples to get their class labels from the experts and those are added to the training set, which can improve the accuracy of the prediction consequently. A novel active learning technique using fuzzy-rough nearest neighbor classifier (ALFRNN) is proposed in this paper for cancer classification from microarray gene expression data. The proposed ALFRNN method is capable of dealing with the uncertainty, overlapping and indiscernibility often present in cancer subtypes (classes) of the gene expression data. The performance of the proposed method is tested using different real-life microarray gene expression cancer datasets and its performance is compared with five other state-of-the-art techniques (out of which three are active learning-based and two are traditional classification methods) in terms of percentage accuracy, precision, recall, F1-measures and kappa. Superiority of the proposed method over the other counterpart algorithms is established from experimental results for cancer prediction and results of the paired t-test confirm statistical significance of the results in favor of the proposed method for almost all the datasets.
Case-Base Maintenance (CBM) becomes of great importance when implementing a Computer-Aided Diagnostic (CAD) system using Case-Based Reasoning (CBR). Since it is essential for the learning to avoid the case-base degradation, this work aims to build and maintain a quality case base while overcoming the difficulty of assembling labeled case bases, traditionally assumed to exist or determined by human experts. The proposed approach takes advantage of large volumes of unlabeled data to select valuable cases to add to the case base while monitoring retention to avoid performance degradation and to build a compact quality case base. We use machine learning techniques to cope with this challenge: an Active Semi-Supervised Learning approach is proposed to overcome the bottleneck of scarcity of labeled data. In order to acquire a quality case base, we target its performance criterion. Case selection and retention are assessed according to three combined sampling criteria: informativeness, representativeness, and diversity. We support our approach with empirical evaluations using different benchmark data sets. Based on experimentation, the proposed approach achieves good classification accuracy with a small number of retained cases, using a small training set as a case base.
For the solution to the problems of being difficult to gain mass class-labeled samples in the supervised learning process and of reducing the cost of data labeling, a fast optimization of support vectors based on convex hull vector is proposed with the learning mechanism of Support Vector Machine (SVM). By means of calculating the hull vectors of the sample set, label those chosen hull vectors that are largest possible to be support vectors, and also add the unlabeled samples with high confidence coefficient of classifiers to the training sample set. The very information beneficial to the learner in the unlabeled samples set will be exploited. Hence, the thick convex hull method and the modified weighted SVM are separately directed for the nonlinear separable problem and the unbalanced training sample set. Via experimental testing on the UCI data set, the results demonstrate that the algorithm harvests SVM classifiers of higher classification accuracy and better generalization performance with fewer labeled samples, so as to cut down the labeling cost of samples for SVM training and learning.
Network anomalies significantly impact the efficiency and stability of network systems, making effective anomaly detection crucial for optimal performance and prevention of network breakdowns. However, conventional methods must be improved for handling anomalies’ complexities and evolving nature. Despite extensive research in network anomaly detection (NAD) techniques, there is a need for more systematic literature reviews incorporating recent advances, particularly in dynamic and heterogeneous network settings. Moreover, most review papers focus on individual detection methods, needing a unified framework for comprehensive anomaly detection. To bridge these gaps, this paper conducts a comprehensive analysis by conducting a systematic literature review and formulating five research questions to outline the objectives of this study. A holistic framework is proposed, integrating techniques based on preprocessing and Feature Selection into prediction models to develop more accurate, efficient, and reliable anomaly detection systems. The empirical evaluation assesses the effectiveness, accuracy, efficiency, and reliability of the data-driven NAD techniques. Finally, the study identifies research gaps and potential future directions to guide further advancements in developing accurate and efficient anomaly detection models. By synthesizing and analyzing 116 top-cited papers, this study contributes to the existing body of knowledge by highlighting the potential of emerging anomaly detection techniques in complex and dynamic network environments.
The fault diagnosis in the real world is often complicated. It is due to the fact that not all relevant fault information is available directly. In many fault diagnosis situations, it is impossible or inconvenient to find all fault information before establishing a fault diagnosis model. To deal with this issue, a method named active example selection (AES) is proposed for the fault diagnosis. AES could actively discover unseen faults and choose useful samples to improve the fault detection accuracy. AES consists of three key components: (1) a fusion model of combining the advantage of the unsupervised and supervised fault diagnosis methods, where the unsupervised fault diagnosis methods could discover unseen faults and the supervised fault diagnosis methods could provide better fault detection accuracy on seen faults, (2) an active learning algorithm to help the supervised fault diagnosis methods actively discover unseen faults and choose useful samples to improve the fault detection accuracy, and (3) an incremental learning scheme to speed up the iterative training procedure for AES. The proposed method was evaluated on the benchmark Tennessee Eastman Process data. The proposed method performed better on both unseen and seen faults than the stand-alone unsupervised, supervised fault diagnosis methods, their joint and referenced support vector machines based on active learning.
We present a freely available named-entity recognizer for Greek texts that identifies temporal expressions, person, and organization names. For temporal expressions, it relies on semi-automatically produced patterns. For person and organization names, it employs an ensemble of Support Vector Machines that scan the input text in two passes. The ensemble is trained using active learning, whereby the system itself proposes candidate training instances to be annotated by a human during training. The recognizer was evaluated on both a general collection of newspaper articles and a more focussed, in terms of topics, collection of financial articles.
Pattern mining provides useful tools for exploratory data analysis. Numerous efficient algorithms exist that are able to discover various types of patterns in large datasets. Unfortunately, the problem of identifying patterns that are genuinely interesting to a particular user remains challenging. Current approaches generally require considerable data mining expertise or effort from the data analyst, and hence cannot be used by typical domain experts.
To address this, we introduce a generic framework for interactive learning of userspecific pattern ranking functions. The user is only asked to rank small sets of patterns, while a ranking function is inferred from this feedback by preference learning techniques. Moreover, we propose a number of active learning heuristics to minimize the effort required from the user, while ensuring that accurate rankings are obtained. We show how the learned ranking functions can be used to mine new, more interesting patterns.
We demonstrate two concrete instances of our framework for two different pattern mining tasks, frequent itemset mining and subgroup discovery. We empirically evaluate the capacity of the algorithm to learn pattern rankings by emulating users. Experiments demonstrate that the system is able to learn accurate rankings, and that the active learning heuristics help reduce the required user effort. Furthermore, using the learned ranking functions as search heuristics allows discovering patterns of higher quality than those in the initial set. This shows that machine learning techniques in general, and active preference learning in particular, are promising building blocks for interactive data mining systems.
Active learning is an important approach to reduce data-collection costs for inductive learning problems by sampling only the most informative instances for labeling. We focus here on the sampling criterion for how to select these most informative instances. Three contributions are made in this paper. First, in contrast to the leading sampling strategy of halving the volume of version space, we present the sampling strategy of reducing the volume of version space by more than half with the assumption of target function being chosen from nonuniform distribution over version space. Second, we propose the idea of sampling the instances that would be most possibly misclassified. Third, we develop a sampling method named CBMPMS (Committee Based Most Possible Misclassification Sampling) which samples the instances that have the largest probability to be misclassified by the current classifier. Comparing the proposed CBMPMS method with the existing active learning methods, when the classifiers achieve the same accuracy, the former method will sample fewer times than the latter ones. The experiments show that the proposed method outperforms the traditional sampling methods on most selected datasets.
Intrusion detection systems play an important role in computer security. To make intrusion detection systems adaptive to changing environments, supervised learning techniques had been applied in intrusion detection. However, supervised learning needs a large amount of training instances to obtain classifiers with high accuracy. Limited to lack of high quality labeled instances, some researchers focused on semi-supervised learning to utilize unlabeled instances enhancing classification. But involving the unlabeled instances into the learning process also introduces vulnerability: attackers can generate fake unlabeled instances to mislead the final classifier so that a few intrusions can not be detected. In this paper we show that the attacker could mislead the semi-supervised intrusion detection classifier by poisoning the unlabeled instances. And we propose a defend method based on active learning to defeat the poisoning attack. Experiments show that the poisoning attack can reduce the accuracy of the semi-supervised learning classifier and the proposed defending method based on active learning can obtain higher accuracy than the original semi-supervised learner under the presented poisoning attack.
Action model learning can relieve people from writing planning domain descriptions from scratch. Real-world learners need to be sensitive to all kinds of expenses which it will spend in the learning. However, most of previous studies in this research line only considered the running time as the learning cost. In real-world applications, we will spend extra expense when we carry out actions or get observations, particularly for online learning. The learning algorithm should apply more techniques for saving the total cost when keeping a high rate of accuracy. The cost of carrying out actions and getting observations is the dominated expense in online learning. Therefore, we design a cost-sensitive algorithm to learn action models under partial observability. It combines three techniques to lessen the total cost: constraints, filtering and active learning. These techniques are used in observation reduction in action model learning. First, the algorithm uses constraints to confine the observation space. Second, it removes unnecessary observations by belief state filtering. Third, it actively picks up observations based on the results of the previous two techniques. This paper also designs strategies to reduce the amount of plan steps used in the learning. We performed experiments on some benchmark domains. It shows two results. For one thing, the learning accuracy is high in most cases. For the other, the algorithm dramatically reduces the total cost according to the definition of cost in this paper. Therefore, it is significant for real-world learners, especially, when long plans are unavailable or observations are expensive.
This paper presents a "Global Learning Method" (G.L.M.) that can be used to construct a human-friendly interface for man-machine systems. The aims of the G.L.M. is to identify, in a user-friendly manner, the user's requirements of the environments by means of a fuzzy set in order to give adequate instructions to the machine to make a desired environment for the user. First, man-machine systems are considered from the viewpoint of human friendly interface. Second, the G.L.M. is proposed, and third, the method is applied to the practical problem of identifying the user's requirements in a color environment of the Graphical User Interface (GUI) of personal computers and the results are shown. Finally, the effectiveness of introducing the G.L.M. to the construction of a human friendly interface is discussed from the viewpoint of man-machine systems.
In order to improve the accuracy and precision of online learning-based collision detection methods, an online active ensemble learning for robot collision detection (OAELRCD) is proposed in this paper. The OAELRCD consists of two key components: (1) an ensemble learning method to combine several base classifiers in order to improve the accuracy and precision of collision detection, (2) an active learning algorithm to reduce the number of training samples in order to realize online training and learning when the environment changes. We evaluate the proposed OAELRCD on one robot arm in dynamic environments with moving workspace obstacles, showing that the proposed OAELRCD outperforms state-of-the-art online learning-based method and geometric collision checkers. Compared to the state-of-the-art online learning-based method for robot collision detection in dynamic environments, the proposed OAELRCD provides noticeable improvements in TPR, AUC, Accuracy and TNR. Compared to state-of-the-art geometric collision checkers, with the proposed OAELRCD, collision checks are faster.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.