The paper describes an integrated recognition-by-parts architecture for reliable and robust face recognition. Reliability and robustness are characteristic of the ability to deploy full-fledged and operational biometric engines, and handling adverse image conditions that include among others uncooperative subjects, occlusion, and temporal variability, respectively. The architecture proposed is model-free and non-parametric. The conceptual framework draws support from discriminative methods using likelihood ratios. At the conceptual level it links forensics and biometrics, while at the implementation level it links the Bayesian framework and statistical learning theory (SLT). Layered categorization starts with face detection using implicit rather than explicit segmentation. It proceeds with face authentication that involves feature selection of local patch instances including dimensionality reduction, exemplar-based clustering of patches into parts, and data fusion for matching using boosting driven by parts that play the role of weak-learners. Face authentication shares the same implementation with face detection. The implementation, driven by transduction, employs proximity and typicality (ranking) realized using strangeness and p-values, respectively. The feasibility and reliability of the proposed architecture are illustrated using FRGC data. The paper concludes with suggestions for augmenting and enhancing the scope and utility of the proposed architecture.
The accelerating progress and availability of low cost computers, high speed networks, and software for high performance distributed computing allow us to reconsider computationally expensive techniques in image processing and pattern recognition. We propose a two-level hierarchical k-nearest neighbor classifier where the first level uses graphics processor units (GPUs) and the second level uses a high performance cluster (HPC). The system is evaluated on the problem of character recognition with nine databases (Arabic digits, Indian digits (Bangla, Devnagari, and Oriya), Bangla characters, Indonesian characters, Arabic characters, Farsi characters and digits). Contrary to many approaches that tune the model for different scripts, the proposed image classification method is unchanged throughout the evaluation on the nine databases. We show that a hierarchical combination of decisions based on two distances, using GPUs and a HPC provides state-of-the-art performances on several scripts, and provides a better accuracy than more complex systems.
Analysis and classification for remote sensing landscape based on remote sensing imagery is a popular research topic. In this paper, we propose a new remote sensing data classifier by incorporating the support vector machine (SVM) learning information into the K-nearest neighbor (KNN) classifier. The SVM is well known for its extraordinary generalization capability even with limited learning samples, and it is very useful for remote sensing applications as data samples are usually limited. The KNN has been widely used in data classification due to its simplicity and effectiveness. However, the KNN is instance-based and needs to keep all the training samples for classification, which could cause not only high computation complexity but also overfitting problems. Meanwhile, the performance of the KNN classifier is sensitive to the neighborhood size K and how to select the value of the parameter K relies heavily on practice and experience. Based on the observations that the SVM can contribute to the KNN on the problems of smaller training samples size as well as the selection of the parameter K, we propose a support vector nearest neighbor (abbreviated as SV-NN) hybrid classification approach which can simplify the parameter selection while maintaining classification accuracy. The proposed approach is consist of two stages. In the first stage, the SVM is performed on the training samples to obtain the reduced support vectors (SVs) for each of the sample categories. In the second stage, a nearest neighbor classifier (NNC) is used to classify a testing sample, i.e. the average Euclidean distance between the testing data point to each set of SVs from different categories is calculated and the NNC identifies the category with minimum distance. To evaluate the effectiveness of the proposed approach, firstly experiments of classification for samples from remote sensing data are evaluated, and then experiments of identifying different land covers regions in the remote sensing images are evaluated. Experimental results show that the SV-NN approach maintains good classification accuracy while reduces the training samples compared with the conventional SVM and KNN classification model.
In remote sensing image classification, distance measurements and classification criteria are equally important; and less accuracy of either would affect classification accuracy. Remote sensing image classification was performed by combining support vector machine (SVM) and k-nearest neighbor (KNN). This was based on the separability of classes using SVM and the spatial and spectral characteristics of remote sensing data. Moreover, a distance formula is proposed as the measure criterion that considers both luminance and direction of the vectors. First, the SVM is trained and the support vectors (SVs) are obtained for each class. In the testing phase, newly tested samples were entered, and average distance between the test samples and SVs for each class are calculated using the distance formula. Finally, decisions are made to classify the test samples into the class with minimal average distance. This procedure is repeated until all test samples are classified. In the combinatorial algorithm, the nearest neighbor classifier is used to classify a testing sample, i.e. the average Euclidean distance between the testing data point to each set of SVs from different categories is calculated and the nearest neighbor classifier identifies the category with minimum distance. The proposed combinatorial algorithm has advantages over the conventional KNN for eliminating the k parameter selection problem and reducing heavy learning time. A comparison is carried out with SVM, KNN and Spectral Angle Mapper (SAM) classification using ALOS/PALSAR and PSM images, and the effectiveness of the proposed algorithm is demonstrated.
The aim of this study is to test the freshness of horse mackerels by using a low cost electronic nose system composed of eight different metal oxide sensors. The process of freshness evaluation covers a scala of seven different classes corresponding to 1, 3, 5, 7, 9, 11, and 13 storage days. These seven classes are categorized according to six different classifiers in the proposed binary decision tree structure. Classifiers at each particular node of the tree are individually trained with the training dataset. To increase success in determining the level of fish freshness, one of the k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and Bayes methods is selected for every classifier and the feature spaces change in every node. The significance of this study among the others in the literature is that this proposed decision tree structure has never been applied to determine fish freshness before. Because the freshness of fish is observed under actual market storage conditions, the classification is more difficult. The results show that the electronic nose designed with the proposed decision tree structure is able to determine the freshness of horse mackerels with 85.71% accuracy for the test data obtained one year after the training process. Also, the performances of the proposed methods are compared against conventional methods such as Bayes, k-NN, and LDA.
An interesting research area that permits the user to mine the significant information, called frequent subgraph, is Graph-Based Data Mining (GBDM). One of the well-known algorithms developed to extract frequent patterns is GASTON algorithm. Retrieving the interesting webpages from the log files contributes heavily to various applications. In this work, a webpage recommendation system has been proposed by introducing Chronological Cuckoo Search (Chronological-CS) algorithm and the Laplace correction based k-Nearest Neighbor (LKNN) to retrieve the useful webpage from the interesting webpage. Initially, W-Gaston algorithm extracts the interesting subgraph from the log files and provides it to the proposed webpage recommendation system. The interesting subgraphs subjected to clustering with the proposed Chronological-CS algorithm, which is developed by integrating the chronological concept into Cuckoo Search (CS) algorithm, provide various cluster groups. Then, the proposed LKNN algorithm recommends the webpage from the clusters. Simulation of the proposed webpage recommendation algorithm is done by utilizing the data from MSNBC and weblog database. The results are compared with various existing webpage recommendation models and analyzed based on precision, recall, and F-measure. The proposed webpage recommendation model achieved better performance than the existing models with the values of 0.9194, 0.8947, and 0.86736, respectively, for the precision, recall, and F-measure.
Gatherings of thousands to millions of people frequently occur for an enormous variety of educational, social, sporting, and political events, and automated counting of these high-density crowds is useful for safety, management, and measuring significance of an event. In this work, we show that the regularly accepted labeling scheme of crowd density maps for training deep neural networks may not be the most effective one. We propose an alternative inverse k-nearest neighbor (ikNN) map mechanism that, even when used directly in existing state-of-the-art network structures, shows superior performance. We also provide new network architecture mechanisms that we demonstrate in our own MUD-ikNN network architecture, which uses multi-scale drop-in replacement upsampling via transposed convolutions to take full advantage of the provided ikNN labeling. This upsampling combined with the ikNN maps further improves crowd counting accuracy. We further analyze several variations of the ikNN labeling mechanism, which apply transformations on the kNN measure before generating the map, in order to consider the impact of camera perspective views, image resolutions, and the changing rates of the mapping functions. To alleviate the effects of crowd density changes in each image, we also introduce an attenuation mechanism in the ikNN mapping. Experimentally, we show that inverse square root kNN map variation (iRkNN) provides the best performance. Discussions are provided on computational complexity, label resolutions, the gains in mapping and upsampling, and details of critical cases such as various crowd counts, uneven crowd densities, and crowd occlusions.
Traditional classifiers are trapped by the class-imbalanced problem due to the fact that they are biased toward the majority class. Oversampling methods can improve imbalanced classification by creating synthetic minority class samples. Noise generation has been a great challenge in oversampling methods. Filtering-based and direction-change methods are proposed against noise generation. Yet, the adopted noise filters in filtering-based methods are biased to the majority class. Besides, the k-nearest neighbor (KNN)-based interpolation in filtering-based and direction-change methods is susceptible to abnormal samples (e.g. outliers, noise or unsafe borderline samples). To overcome noise generation while solving the above shortcomings of filtering-based and direction-change methods, this work presents a new synthetic minority oversampling technique based on local means-based KNN (SMOTE-LMKNN). In SMOTE-LMKNN, the local mean-based KNN (LMKNN) is first introduced to describe the local characteristic of imbalanced data. Second, a new LMKNN-based noise filter is proposed to remove noise and unsafe borderline samples. Third, the interpolation between a base sample and its LMKNN is proposed to create synthetic minority class samples. Empirical results of extensive experiments with 18 data sets show that SMOTE-LMKNN is competent compared with seven popular oversampling methods in training KNN classifier and classification and regression tree (CART).
The production of nitrogen oxides (NOx) in coal-fired boiler combustion has been found as a significant source of environmental pollution. Flue gas denitrification is a standard NOx control technology for small- and medium-sized coal-fired boilers. Achieving steady-state control in flue gas denitrification can be challenging since coal-fired boiler systems have complexity and significant delay. A model based on a learning-based K-nearest neighbor (KNN) query mechanism created for NOx output soft prediction is proposed in this study. First, a knowledge base in the proposed model is established through spatial division in accordance with the previous combustion parameters. Moreover, the clusters are established based on the output NOx values. Next, the domain of values of combustion parameters for the respective cluster is obtained. Second, the optimal cluster is selected using the knowledge base for an input vector q with new combustion parameters (q1, q2, …, qn). Lastly, the K tuples in the cluster the closest to the values of the input vector q are adopted to predict the output NOx value of q. The predicted NOx value can serve as a feedforward signal to control the output of the reductant for accurate denitrification. As revealed by the experimental results, the proposed practical model, capable of conducting the prediction in a sub-second time, is highly competitive with existing techniques. Furthermore, a deep learning algorithm (DLA) is designed, whereas it underperforms the KNN model.
The response to a natural disaster ultimately depends on credible and real-time information regarding impacted people and areas. Nowadays, social media platforms such as Twitter have emerged as the primary and fastest means of disseminating information. Due to the massive, imprecise, and redundant information on Twitter, efficient automatic sentiment analysis (SA) plays a crucial role in enhancing disaster response. This paper proposes a novel methodology to efficiently perform SA of Twitter data during a natural disaster. The tweets during a natural calamity are biased toward the negative polarity, producing imbalanced data. The proposed methodology has reduced the misclassification of minority class samples through the adaptive synthetic sampling technique. A binary modified equilibrium optimizer has been used to remove irrelevant and redundant features. The k-nearest neighbor has been used for sentiment classification with the optimized value of k. The nine datasets on natural disasters have been used for evaluation. The performance of the proposed methodology has been validated using the Friedman mean rank test against nine state-of-the-art techniques, including two optimized, one transfer learning, one deep learning, two ensemble learning, and three baseline classifiers. The results show the significance of the proposed methodology through the average improvement of 6.9%, 13.3%, 20.2%, and 18% for accuracy, precision, recall, and F1-score, respectively, as compared to nine state-of-the-art techniques.
Every second, a huge volume of multi-dimensional data is generated in fields such as Social Networking, Industrial Internet of Things, Stock market and E-commerce applications. Knowledge and pattern extraction are a challenging task in the evolving nature of data stream. Major issues are (i) ‘concept drift’ occurs as a result of pattern changes in the data distribution and (ii) ‘concept evolution’ occurs when a new class evolves in the data stream. These issues degrade the performance of learning models. In this paper, we focus on detection of concept evolution and enhance the performance of classifiers. For this, we propose a new model to identify novel classes, namely, Detection of Novel Classes (DNC). The proposed method adopts long short term memory to continuously observe the streaming data in order to detect emerging classes. The continuous monitoring allows the model to distinguish between existing classes and the novel classes which save time and memory. Also, the proposed method is demonstrated for identifying more than one novel class. The experiments are performed over seven different datasets. The results confirm the efficiency is increased ranging from 6% to 34% by the proposed method in identifying new concepts in the evolving data stream than the existing methods available in the literature.
In remote sensing the intensities from a multispectral image are used in a classification scheme to distinguish different ground cover from each other. An example is given where different soil types are classified. A digitized complete scene from a satellite sensor consists of a large amount of data and in future image sensors the resolution and the number of spectral bands will increase even further. Data parallel computers are therefore well-suited for these types of classification algorithms. This article will focus on three supervised classified algorithms: the Maximum Likelihood, the K-Nearest Neighbor and the Backpropagation algorithm, together with their parallel implementations. They are implemented on the Connection Machine/200 in the high-level language C*. The algorithms are finally tested and compared on an image registered over western Estonia.
In this paper, we analyze a model called the k-nearest neighbor queue with the possibility of having delayed queue length feedback. We prove fluid limits for the stochastic queueing model and show that the fluid limit converges to a system of delay differential equations. Using the properties of circulant matrices, we derive a closed form expression for the value of the critical delay, which governs whether the delayed information will induce oscillations or a Hopf bifurcation in our queueing system.
Logs play an important role in the maintenance of large-scale systems. The number of logs which indicate normal (normal logs) differs greatly from the number of logs that indicate anomalies (abnormal logs), and the two types of logs have certain differences. To automatically obtain faults by K-Nearest Neighbor (KNN) algorithm, an outlier detection method with high accuracy, is an effective way to detect anomalies from logs. However, logs have the characteristics of large scale and very uneven samples, which will affect the results of KNN algorithm on log-based anomaly detection. Thus, we propose an improved KNN algorithm-based method which uses the existing mean-shift clustering algorithm to efficiently select the training set from massive logs. Then we assign different weights to samples with different distances, which reduces the negative effect of unbalanced distribution of the log samples on the accuracy of KNN algorithm. By comparing experiments on log sets from five supercomputers, the results show that the method we proposed can be effectively applied to log-based anomaly detection, and the accuracy, recall rate and F measure with our method are higher than those of traditional keyword search method.
Principal components analysis (PCA) is a popular linear feature extractor, and widely used in signal processing, face recognition, etc. However, axes of the lower-dimensional space, i.e., principal components, are a set of new variables carrying no clear physical meanings. Thus we propose unsupervised feature selection algorithms based on eigenvectors analysis to identify critical original features for principal component. The presented algorithms are based on k-nearest neighbor rule to find the predominant row components and eight new measures are proposed to compute the correlation between row components in transformation matrix. Experiments are conducted on benchmark data sets and facial image data sets for gender classification to show their superiorities.
During the pandemic, the most significant reason for the deep concern for COVID-19 is that it spreads from individual to individual through contact or by staying close with the diseased individual. COVID-19 has been understood as an overall pandemic, and a couple of assessments is being performed using various numerical models. Machine Learning (ML) is commonly used in every field. Forecasting systems based on ML have shown their importance in interpreting perioperative effects to accelerate decision-making in the potential course of action. ML models have been used for long to define and prioritize adverse threat variables in several technology domains. To manage forecasting challenges, many prediction approaches have been used extensively. The paper shows the ability of ML models to estimate the amount of forthcoming COVID-19 victims that is now considered a serious threat to civilization. COVID-19 describes the comparative study on ML algorithms for predicting COVID-19, depicts the data to be predicted, and analyses the attributes of COVID-19 cases in different places. It gives an underlying benchmark to exhibit the capability of ML models for future examination.
The liver cancer is one of the most common fatal diseases worldwide, and its early detection through medical imaging is a major contributor to the reduction in mortality from certain cancer. This paves the way to work on diagnosing liver diseases effectively. An accurate diagnosis of liver disease in CT image requires an efficient description of textures and classification methods. This paper performs comparative analysis on proposed texture feature descriptor with the different existing texture features with various classifiers to classify six types of diffused and focal liver diseases. The classification of liver diseases is done in two stages. In first stage, features like segmentation based fractal texture analysis, counting label occurrence matrix, local configuration pattern, eXtended center-symmetric local binary pattern and the proposed local symmetric tetra pattern are used for extracting information from the CT liver structure and classifiers like support vector machine, k-nearest neighbor, and naive Bayes are used for classifying the pathologic liver. When pathologic conditions are detected, the best feature descriptors and classifiers are used to classify the results into any of six exclusive pathologic liver diseases, in second stage. The experiments are carried out in medically validated liver datasets containing normal and six-disease category of liver. The first experiment is analyzed using sensitivity, specificity, and accuracy. The second experiment is evaluated using precision, recall, BCR, and F-measure. The results demonstrate that the local symmetric tetra pattern with k-nearest neighbor classifier culminates in a state-of-the-art performance for diagnosing liver diseases.
This paper describes a computer-based identification system of normal and alcoholic Electroencephalography (EEG) signals. The identification system was constructed from feature extraction and classification algorithms. The feature extraction was based on wavelet packet decomposition (WPD) and energy measures. Feature fitness was established through the statistical t-test method. The extracted features were used as training and test data for a competitive 10-fold cross-validated analysis of six classification algorithms. This analysis showed that, with an accuracy of 95.8%, the k-nearest neighbor (k-NN) algorithm outperforms naïve Bayes classification (NBC), fuzzy Sugeno classifier (FSC), probabilistic neural network (PNN), Gaussian mixture model (GMM), and decision tree (DT). The 10-fold stratified cross-validation instilled reliability in the result, therefore we are confident when we state that EEG signals can be used to automate both diagnosis and treatment monitoring of alcoholic patients. Such an automatization can lead to cost reduction by relieving medical experts from routine and administrative tasks.
Electroencephalography (EEG) is a measure which represents the functional activity of the brain. We show that a detailed analysis of EEG measurements provides highly discriminant features which indicate the mental state of patients with clinical depression. Our feature extraction method revolves around a novel processing structure that combines wavelet packet decomposition (WPD) and non-linear algorithms. WPD was used to select appropriate EEG frequency bands. The resulting signals were processed with the non-linear measures of approximate entropy (ApEn), sample entropy (SampEn), renyi entropy (REN) and bispectral phase entropy (Ph). The features were selected using t-test and only discriminative features were fed to various classifiers, namely probabilistic neural network (PNN), support vector machine (SVM), decision tree (DT), k-nearest neighbor algorithm (k-NN), naive bayes classification (NBC), Gaussian mixture model (GMM) and Fuzzy Sugeno Classifier (FSC). Our classification results show that, with a classification accuracy of 99.5%, the PNN classifier performed better than the rest of classifiers in discriminating between normal and depression EEG signals. Hence, the proposed decision support system can be used to diagnose, and monitor the treatment of patients suffering from depression.
Data over the internet has been increasing everyday, and automatic mining of essential information from an enormous amount of data has become a challenging task today for an organisation with a huge dataset. In recent years, the prominent technology in the domain of Information Technology (IT) is big data, which is unstructured data that solves the computational complexity of classical database systems. The data is fast and big and typically derived from multiple and independent sources. The three main challenges are data accessing, semantics, and domain knowledge for various big data utilisations and complexities raised by big data volumes. One of the major limitations is the classification of big data. This paper introduces well-defined classification methodologies employed for big data classification. This paper reviews 50 research papers based on classification methods of big data, and such methodologies are primarily categorised into six different categories, namely K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Fuzzy-based method, Bayesian-based method, Random Forest, and Decision Tree. In addition, detailed analysis and discussion are carried out by considering classification techniques, dataset utilised, evaluation metrics, semantic similarity measures, and publication year. In addition, research gaps and issues for several traditional big data classification techniques are explained to expand investigators’ works to provide effective big data management.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.