Please login to be able to save your searches and receive alerts for new content matching your search criteria.
With the rapid development of data collection methods and their practical applications, the management of uncertain data streams has drawn wide attention in both academia and industry. System capacity planning and Quality of service (QoS) metrics are two very important problems for data stream management systems (DSMSs) to process streams efficiently due to unpredictable input characteristics and limited memory resource in the system. Motivated by this, in this paper, we explore an effective approach to estimate the memory requirement, data loss ratio, and tuple latency of continuous queries for uncertain data streams over sliding windows in a DSMS. More specifically, we propose a queueing model to address these problems in this paper. We study the average number of tuples, average tuple latency in the queue, and the distribution of the number of tuples and tuple latency in the queue under the Poisson arrival of input data streams in our queueing model. Furthermore, we also determine the maximum capacity of the queueing system based on the data loss ratio. The solutions for the above problems are very important to help researchers design, manage, and optimize a DSMS, including allocating buffer needed for a queue and admitting a continuous uncertain query to the system without violation of the pre-specified QoS requirements.
Data stream associative classification poses many challenges to the data mining community. In this paper, we address four major challenges posed, namely, infinite length, extraction of knowledge with single scan, processing time, and accuracy. Since data streams are infinite in length, it is impractical to store and use all the historical data for training. Mining such streaming data for knowledge acquisition is a unique opportunity and even a tough task. A streaming algorithm must scan data once and extract knowledge. While mining data streams, processing time, and accuracy have become two important aspects. In this paper, we propose PSTMiner which considers the nature of data streams and provides an efficient classifier for predicting the class label of real data streams. It has greater potential when compared with many existing classification techniques. Additionally, we propose a compact novel tree structure called PSTree (Prefix Streaming Tree) for storing data. Extensive experiments conducted on 24 real datasets from UCI repository and synthetic datasets from MOA (Massive Online Analysis) show that PSTMiner is consistent. Empirical results show that performance of PSTMiner is highly competitive in terms of accuracy and performance time when compared with other approaches under windowed streaming model.
Data stream learning in non-stationary environments and skewed class distributions has been receiving more attention in machine learning communities. This paper proposes a novel ensemble classification method (ECSDS) for classifying data streams with skewed class distributions. In the proposed ensemble method, back-propagation neural network is selected as the base classifier. In order to demonstrate the effectiveness of our proposed method, we choose three baseline methods based on ECSDS and evaluate their overall performance on ten datasets from UCI machine learning repository. Moreover, the performance of incremental learning is also evaluated by these datasets. The experimental results show our proposed method can effectively deal with classification problems on non-stationary data streams with class imbalance.
Querying data stream data continuously has been addressed mostly as transactional queries with some attempts at analytical processing. But, in most of the proposals a single query is executed for a given window of data. In this paper, we propose to continuously execute multiple related OLAP queries (CMOLAP) for the data chosen from a data stream. The chosen data defines the context. The context data is temporarily stored in the form of a multidimensional cube to perform OLAP operations. Three sets of operations are defined. The first converts the data in a stream to a context, the second allows altering the context and the third set is analytical which operates on the context and produces an output stream. More than one related analytic operation can be performed for the data in a context. The sequence of operations, referred to as context queries, is continuously executed for a time-based window. As a result it is possible to do enhanced related analysis of data. We have also developed a GUI interface where the queries can be expressed in a user friendly manner.
Data stream mining has attracted considerable attention over the past few years owing to the significance of its applications. Streaming data is often evolving over time. Capturing changes could be used for detecting an event or a phenomenon in various applications. Weather conditions, economical changes, astronomical, and scientific phenomena are among a wide range of applications. Because of the high volume and speed of data streams, it is computationally hard to capture these changes from raw data in real-time. In this paper, we propose a novel algorithm that we term as STREAM-DETECT to capture these changes in data stream distribution and/or domain using clustering result deviation. STREAM-DETECT is followed by a process of offline classification CHANGE-CLASS. This classification is concerned with the association of the history of change characteristics with the observed event or phenomenon. Experimental results show the efficiency of the proposed framework in both detecting the changes and classification accuracy.
As distributed IoT applications become larger and more complex, the pure processing of raw sensor and actuation data streams becomes impractical. Instead, data streams must be fused into tangible facts and these pieces of information must be combined with a background knowledge to infer new pieces of knowledge. And since many IoT applications require almost real-time reactivity to stimulus of the environment, such information inference process has to be performed in a continuous, on-line manner. This paper proposes a new semantic model for data stream processing and real-time reasoning based on the concepts of Semantic Stream and Fact Stream, as a natural extension of Complex Event Processing (CEP) and RDF (graph-based knowledge model). The main advantages of our approach are that: (a) it considers time as a key relation between pieces of information; (b) the processing of streams can be implemented using CEP; (c) it is general enough to be applied to any Data Stream Management System (DSMS). We describe a scenario about patients flux monitoring in a hospital as an example of prospective application. Last, we present challenges and prospects on using machine learning and induction algorithms to learn abstractions and reasoning rules from a continuous data stream.
In the era of big data, companies are increasingly driven to amass vast amounts of data, particularly in process industries where advanced sensor technologies are prevalent. However, obtaining accurate labels or product information through quality inspections can be prohibitively expensive. Active learning emerges as a promising approach to optimize data sampling by prioritizing the most informative data points. Nevertheless, active learning strategies heavily rely on predictive models that are iteratively updated. Aligning with the principles of data-centric AI, this study highlights the detrimental effects of passively incorporating all available process variables into a predictive model for guiding data collection. Specifically, in real-time sampling strategies based on online active learning, the inclusion of irrelevant features significantly hampers the efficiency of the learning process.
With the advent of the Internet of Things, sensors have become almost ubiquitous. From monitoring air quality to optimising manufacturing processes, networks of sensors collect vast quantities of data every day, creating multiple data streams. While some applications may warrant the use of very high quality sensors that are regularly calibrated, this may not always be the case. In most applications, sensors may have been calibrated prior to installation but their performance in the field may be compromised, or a combination of low and high quality sensors may be used. Thus, the uncertainty of the data from a sensor operating in the field may be unknown, or the quality of an estimate of the uncertainty from a prior calibration of the sensor may not be known. Taking into account the unknown or doubtful provenance of the data is critical in its subsequent analysis in order to derive meaningful insights from it.
Many data analysis problems arising in metrology involve fitting a model to observed or measured data. This paper addresses the issue of how to make inferences about model parameters based on data from multiple streams, each of doubtful or unknown provenance. It is assumed the data arises in a system modelled as a linear response subject to Gaussian noise. In order to account for the lack of perfect knowledge of data quality, Bayesian hierarchical models are used. Hyper-parameters of a gamma distribution encode the degree of belief in uncertainty estimates associated with each data stream. In this way, information on the uncertainties can be learnt or updated based on the data. The methodology can be extended to data streams for which uncertainty statements are missing. The Bayesian posterior distribution for such hierarchical models cannot be expressed analytically in closed form but approximate inferences can be made in terms of the Laplace approximation to the posterior distribution determined by finding its mode i.e., the maximum a posteriori (MAP) estimate. The Laplace approximation can in turn be used in a Metropolis-Hastings algorithm to sample from the posterior distribution. The methodology is illustrated on metrological examples, including the case of inter-laboratory comparisons.
There are growing interests in algorithms over data streams recently. This paper introduces the problem of sampling from sliding windows of recent data items from data streams and presents two random sampling algorithms for this problem. The first algorithm is a basic window-based sampling algorithm (BWRS Algorithm) for count-based sliding window. BWRS algorithm extends classic reservoir sampling to deal with the expiration of data elements from count-based sliding window, and can avoid drawbacks of classic reservoir sampling. The second algorithm is a stratified multistage sampling algorithm for time-based sliding window (SMS Algorithm). The SMS algorithm takes different sampling fraction in different strata from time-based sliding window, and works even when the number of data items in the sliding window varies dynamically over time. The theoretic analysis and experiments show that the algorithms are effective and efficient for continuous data streams processing.
The study on streaming data is one of the hot topics among the database around the world recently. Now, we have many useful methods when we query in this kinds of new data. The sliding window model is a very important one. Some arithmetic using this model can help us finding the frequent items in data steams. However, most of them just considered the frequent times of each item, not considered the time span they arrive or the change pattern of data. In the paper, we introduce a new strategy using sliding window for finding some interesting patterns of steaming data.