Please login to be able to save your searches and receive alerts for new content matching your search criteria.
One of the greatest challenges in data mining is related to processing and analysis of massive data streams. Contrary to traditional static data mining problems, data streams require that each element is processed only once, the amount of allocated memory is constant and the models incorporate changes of investigated streams. A vast majority of available methods have been developed for data stream classification and only a few of them attempted to solve regression problems, using various heuristic approaches. In this paper, we develop mathematically justified regression models working in a time-varying environment. More specifically, we study incremental versions of generalized regression neural networks, called IGRNNs, and we prove their tracking properties — weak (in probability) and strong (with probability one) convergence assuming various concept drift scenarios. First, we present the IGRNNs, based on the Parzen kernels, for modeling stationary systems under nonstationary noise. Next, we extend our approach to modeling time-varying systems under nonstationary noise. We present several types of concept drifts to be handled by our approach in such a way that weak and strong convergence holds under certain conditions. Finally, in the series of simulations, we compare our method with commonly used heuristic approaches, based on forgetting mechanism or sliding windows, to deal with concept drift. Finally, we apply our concept in a real life scenario solving the problem of currency exchange rates prediction.
Every second, a huge volume of multi-dimensional data is generated in fields such as Social Networking, Industrial Internet of Things, Stock market and E-commerce applications. Knowledge and pattern extraction are a challenging task in the evolving nature of data stream. Major issues are (i) ‘concept drift’ occurs as a result of pattern changes in the data distribution and (ii) ‘concept evolution’ occurs when a new class evolves in the data stream. These issues degrade the performance of learning models. In this paper, we focus on detection of concept evolution and enhance the performance of classifiers. For this, we propose a new model to identify novel classes, namely, Detection of Novel Classes (DNC). The proposed method adopts long short term memory to continuously observe the streaming data in order to detect emerging classes. The continuous monitoring allows the model to distinguish between existing classes and the novel classes which save time and memory. Also, the proposed method is demonstrated for identifying more than one novel class. The experiments are performed over seven different datasets. The results confirm the efficiency is increased ranging from 6% to 34% by the proposed method in identifying new concepts in the evolving data stream than the existing methods available in the literature.
A software system evolves as changes are made to accommodate new features and repair defects. Software components are frequently interdependent, so changes made to one component can result in changes having to be made to other components to ensure that the system remains consistent; this is called change propagation. Accurate detection of change propagation is essential for software maintenance, which can be aided by accurate prediction of change propagation. In this paper, we study change propagation in three leading open-source software products: Linux, FreeBSD, and Apache HTTP Server. We use association rules-based data-mining techniques to detect change-propagation rules from the product version history. These rules are evaluated with respect to different training data sets and different test data sets. We discuss the applicability of using association-rule mining for change propagation, and several related issues. We find that a challenging issue in association-rule mining, concept drift, exists in software systems. Concept drift complicates the task of change-propagation prediction and requires special approaches, different from currently-used techniques for predicting change propagation.
Data stream learning in non-stationary environments and skewed class distributions has been receiving more attention in machine learning communities. This paper proposes a novel ensemble classification method (ECSDS) for classifying data streams with skewed class distributions. In the proposed ensemble method, back-propagation neural network is selected as the base classifier. In order to demonstrate the effectiveness of our proposed method, we choose three baseline methods based on ECSDS and evaluate their overall performance on ten datasets from UCI machine learning repository. Moreover, the performance of incremental learning is also evaluated by these datasets. The experimental results show our proposed method can effectively deal with classification problems on non-stationary data streams with class imbalance.
Massively distributed data mining in large networks such as smart device platforms and peer-to-peer systems is a rapidly developing research area. One important problem here is concept drift, where global data patterns (movement, preferences, activities, etc.) change according to the actual set of participating users, the weather, the time of day, or as a result of events such as accidents or even natural catastrophes. In an important case — when the network is very large but only a few training samples can be obtained at each node locally — no efficient distributed solution is known that could follow concept drift efficiently. This case is characteristic of smart device platforms where each device stores only one local observation or data record related to a learning problem. Here we present two algorithms to handle concept drift. None of the algorithms collects data to a central location, instead models of the data perform random walks in the network, while being improved using an online learning algorithm. The first algorithm achieves adaptivity by maintaining young as well as old models in the network according to a fixed age distribution. The second one measures the performance of models locally, and discards them if they are judged outdated. We demonstrate through a thorough experimental analysis that our algorithms outperform the known competing methods if the number of independent local samples is limited relative to the speed of drift: a typical scenario in our targeted application domains. The two algorithms have different strengths: while the age distribution approach is very simple and efficient, explicit drift detection can be useful in monitoring applications to trigger control action.
In ubiquitous data stream mining, different devices often aim to learn concepts that are similar to some extent. In many applications, such as spam filtering or news recommendation, the data stream underlying concept (e.g., interesting mail/news) is likely to change over time. Therefore, the resultant model must be continuously adapted to such changes. This paper presents a novel Collaborative Data Stream Mining (Coll-Stream) approach that explores the similarities in the knowledge available from other devices to improve local classification accuracy. Coll-Stream integrates the community knowledge using an ensemble method where the classifiers are selected and weighted based on their local accuracy for different partitions of the feature space. We evaluate Coll-Stream classification accuracy in situations with concept drift, noise, partition granularity and concept similarity in relation to the local underlying concept. The experimental results show that Coll-Stream resultant model achieves stability and accuracy in a variety of situations using both synthetic and real-world datasets.
Mining is a challenging and important task in a non-stationary data stream. It is used in financial sectors, web log analysis, sensor networks, network traffic management, etc. In this environment, data distribution may change overtime and is called concept drift. So, it is necessary to identify the changes and address them to keep the model relevant to the incoming data. Many researchers have used Drift Detection Method (DDM). However, DDM is very sensitive to detect gradual drift where the detection delay is high. In this paper, we propose Adaptive Drift Detection Method (ADDM) which improves the performance of the drift detection mechanism. The ADDM uses a new parameter to detect the gradual drift in order to reduce the detection delay. The proposed method, ADDM, experiments with six synthetic datasets and four real-world datasets. Experimental results confirm that ADDM reduces the drift detection delay and false-positive rate (FPR) while preserving high classification accuracy.
Concept drift detection algorithms have historically been faithful to the aged architecture of forcefully resetting the base classifiers for each detected drift. This approach prevents underlying classifiers becoming outdated as the distribution of a data stream shifts from one concept to another. In situations where both concept drift and temporal dependence are present within a data stream, forced resetting can cause complications in classifier evaluation. Resetting the base classifier too frequently when temporal dependence is present can cause classifier performance to appear successful, when in fact this is misleading. In this research, a novel architectural method for determining base classifier resets, Burst Detection-Based Selective Classifier Resetting (BD-SCR), is presented. BD-SCR statistically monitors changes in the temporal dependence of a data stream to determine if a base classifier should be reset for detected drifts. The experimental process compares the predictive performance of state-of-the-art drift detectors in comparison to the “No-Change” detector using BD-SCR to inform and control the resetting decision. Results show that BD-SCR effectively reduces the negative impact of temporal dependence during concept drift detection through a clear negation in the performance of the “No-Change” detector, but is capable of maintaining the predictive performance of state-of-the-art drift detection methods.
This work describes the use of a weighted ensemble of neural network classifiers for adaptive learning. We train the neural networks by means of a quantum-inspired evolutionary algorithm (QIEA). The QIEA is also used to determine the best weights for each classifier belonging to the ensemble when a new block of data arrives. After running several simulations using two different datasets and performing two different analysis of the results, we show that the proposed algorithm, named neuro-evolutionary ensemble (NEVE), was able to learn the data set and to quickly respond to any drifts on the underlying data, indicating that our model can be a good alternative to address concept drift problems. We also compare the results obtained by our model with an existing algorithm, Learn++.NSE, in two different nonstationary scenarios.
The spread of real-time applications has led to a huge amount of data shared between users. This vast volume of data rapidly evolving over time is referred to as data stream. Clustering and processing such data poses many challenges to the data mining community. Indeed, traditional data mining techniques become unfeasible to mine such a continuous flow of data where characteristics, features, and concepts are rapidly changing over time. This paper presents a novel method for data stream clustering. In this context, major challenges of data stream processing are addressed, namely, infinite length, concept drift, novelty detection, and feature evolution. To handle these issues, the proposed method uses the Artificial Immune System (AIS) meta-heuristic. The latter has been widely used for data mining tasks and it owns the property of adaptability required by data stream clustering algorithms. Our method, called AIS-Clus, is able to detect novel concepts using the performance of the learning process of the AIS meta-heuristic. Furthermore, AIS-Clus has the ability to adapt its model to handle concept drift and feature evolution for textual data streams. Experimental results have been performed on textual datasets where efficient and promising results are obtained.
This study proposes two approaches for dynamic financial distress prediction (FDP) based on class-imbalanced data batches by considering both concept drift and class imbalance. One is based on sliding time window and synthetic minority over-sampling technique (SMOTE) and the other is based on sliding time window and majority class partition. Support vector machine, multiple discriminant analysis (MDA) and logistic regression are used as base classifiers in the experiments on a real-world dataset. The results indicate that the two approaches perform better than the pure dynamic FDP (DFDP) models without class imbalance processing and the static FDP models either with or without class imbalance processing.
Streaming data mining is in use today in many industrial applications, but performance of the models is deteriorated by concept drift, especially when true labels are unavailable. This paper addresses the need of detecting concept drifts under unsupervised situation and proposes the Unsupervised Concept Drift Detection (UCDD) method. A cluster technique is first applied to determine artificial labels of the data set, then a fast drift detection algorithm is used to detect the boundary change between the labeled clusters. Through the empirical evaluation, the method demonstrates effectiveness on detecting various types of concept drifts.
Multiple data streams learning has attracted high attention recently. However, different feature spaces and uncertain concept drift situations among each stream may lead to the learning decay of machine learning models. To address this issue, this chapter proposes an adaptive stacking method for multiple data stream learning. First, a stacking-based learning framework is built to handle multiple data streams with different feature spaces. Second, a selective retraining scheme is developed for concept drift adaptation. Finally, by testing the proposed method on five data scenarios and comparing it with three benchmarks, the experiment results show the efficiency.
Concept drift refers to the change of distribution of data streams. The correlations between data streams can also change over time. To deal with multiple data streams, a machine learning platform called ML4MDS is proposed for multiple data streams learning. It provides ways to evaluate algorithms designed for multi-stream problems and visualize the results and correlations between data streams dynamically. It is flexible and extensible to add new datasets and new algorithms. The API’s design is inspired by scikit-multiflow, and the browser-based graphical user interface is based on streamlit. It is released under BSD 3-Clause License. The source code is available at https://www.github.com/ml4mds/ml4mds.
In the real-world applications, concept drifting and label missing in data streams seriously aggravate the difficulty on the classification solutions. Contrary to these approaches, we propose a new ensemble classification approach based on the concept drifting detection and model selection for data streams with unlabeled data. Firstly, we build an ensemble model based on the classifiers and clusters. Secondly, we adopt a new concept drifting detection method based on the divergence of concept distributions between two adjoining data chunks to distinguish concept drifts. In the selection of the base model, we adopt the time-stamp based weight and the divergence between concept distributions simultaneously. Finally, extensive experiments show that our approach can quickly adapt to concept drifting data streams, and improve the classification accuracy compared to several state-of-the-art classification algorithms.