Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Association rule is one of the primary tasks in data mining that discovers correlations among items in a transactional database. The majority of vertical and horizontal association rule mining algorithms have been developed to improve the frequent items discovery step which necessitates high demands on training time and memory usage particularly when the input database is very large. In this paper, we overcome the problem of mining very large data by proposing a new parallel Map-Reduce (MR) association rule mining technique called MR-ARM that uses a hybrid data transformation format to quickly finding frequent items and generating rules. The MR programming paradigm is becoming popular for large scale data intensive distributed applications due to its efficiency, simplicity and ease of use, and therefore the proposed algorithm develops a fast parallel distributed batch set intersection method for finding frequent items. Two implementations (Weka, Hadoop) of the proposed MR association rule algorithm have been developed and a number of experiments against small, medium and large data collections have been conducted. The ground bases of the comparisons are time required by the algorithm for: data initialisation, frequent items discovery, rule generation, etc. The results show that MR-ARM is very useful tool for mining association rules from large datasets in a distributed environment.
Associative classification (AC) is a research topic that integrates association rules with classification in data mining to build classifiers. After dissemination of the Classification-based Association Rule algorithm (CBA), the majority of its successors have been developed to improve either CBA's prediction accuracy or the search for frequent ruleitems in the rule discovery step. Both of these steps require high demands in processing time and memory especially in cases of large training data sets or a low minimum support threshold value. In this paper, we overcome the problem of mining large training data sets by proposing a new learning method that repeatedly transforms data between line and item spaces to quickly discover frequent ruleitems, generate rules, subsequently rank and prune rules. This new learning method has been implemented in a parallel Map-Reduce (MR) algorithm called MRMCAR which can be considered the first parallel AC algorithm in the literature. The new learning method can be utilised in the different steps within any AC or association rule mining algorithms which scales well if contrasted with current horizontal or vertical methods. Two versions of the learning method (Weka, Hadoop) have been implemented and a number of experiments against different data sets have been conducted. The ground bases of the comparisons are classification accuracy and time required by the algorithm for data initialization, frequent ruleitems discovery, rule generation and rule pruning. The results reveal that MRMCAR is superior to both current AC mining algorithms and rule based classification algorithms in improving the classification performance with respect to accuracy.
During the development of intelligent transportation systems, traffic data has the characteristics of streaming, high dimension and uncertainty. In order to realize the query of uncertain traffic data streams in a distributed environment, the authors design the algorithm of Uncertain Traffic Data Stream Parallel Continuous Query algorithm (UTDSPCQ). Firstly, the sliding window mode is applied to realize the data receiving and buffering in the data stream environment, so as to adapt to the MapReduce computing framework of the Hadoop distributed structure. Then, the impact of the high dimensionality and uncertainty of the data on the feature analysis of the dataset is reduced, through the dimension reduction and data rewriting. Finally, a multi-attribute data point RePoint is newly defined, to solve the problem of data dimension increase caused by data rewriting. Experiments show that this algorithm optimizes the traditional density-based clustering algorithm, and make it more adaptable to parallel continuous queries for uncertain traffic data streams, and can fully consider the newly generated streaming traffic data.
In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run K-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.
FP-Growth algorithm is an algorithm of association rules that does not generate a set of candidate, so it has very high practical value in face of the rapid growth of data volume in wisdom medical treatment. Because FP-Growth is a memory-resident algorithm, it will appear to be powerless when it is used for massive data sets. The paper combines Hadoop and FP-Growth algorithm and through the actual analysis of traditional Chinese medicine (TCM) data compares the performance in two different environments of stand-alone and distributed. The experimental results show that FP-Growth algorithm has a great advantage in the processing and execution of massive data after the MapReduce parallel model, so that it will have better development prospects for intelligent medical treatment.
With the development of digital image processing technology, the application scope of image recognition is more and more wide, involving all aspects of life. In particular, the rapid development of urbanization and the popularization and application of automobiles in recent years have led to a sharp increase in traffic problems in various countries, resulting in intelligent transportation technology based on image processing optimization control becoming an important research field of intelligent systems. Aiming at the application demand analysis of intelligent transportation system, this paper designs a set of high-definition bayonet systems for intelligent transportation. It combines data mining technology and distributed parallel Hadoop technology to design the architecture and analysis of intelligent traffic operation state data analysis. The mining algorithm suitable for the system proves the feasibility of the intelligent traffic operation state data analysis system with the actual traffic big data experiment, and aims to provide decision-making opinions for the traffic state. Using the deployed Hadoop server cluster and the AdaBoost algorithm of the improved MapReduce programming model, the example runs large traffic data, performs traffic analysis and speed–overspeed analysis, and extracts information conducive to traffic control. It proves the feasibility and effectiveness of using Hadoop platform to mine massive traffic information.
In view of the fact that the existing intrusion detection system (IDS) based on clustering algorithm cannot adapt to the large-scale growth of system logs, a K-mediods clustering intrusion detection algorithm based on differential evolution suitable for cloud computing environment is proposed. First, the differential evolution algorithm is combined with the K-mediods clustering algorithm in order to use the powerful global search capability of the differential evolution algorithm to improve the convergence efficiency of large-scale data sample clustering. Second, in order to further improve the optimization ability of clustering, a dynamic Gemini population scheme was adopted to improve the differential evolution algorithm, thereby maintaining the diversity of the population while improving the problem of being easily trapped into a local optimum. Finally, in the intrusion detection processing of big data, the optimized clustering algorithm is designed in parallel under the Hadoop Map Reduce framework. Simulation experiments were performed in the open source cloud computing framework Hadoop cluster environment. Experimental results show that the overall detection effect of the proposed algorithm is significantly better than the existing intrusion detection algorithms.
With the rapid development of sharing economy, bike-sharing becomes essential because of its zero emission, high flexibility and accessibility. The emergence of the public bicycle system not only alleviates the traffic pressure to a certain extent, but also contributes to solving the “last kilometer” problem of public transportation. However, due to the concentrated use of shared bikes, many shared bikes are left in disorder, which seriously affects the urban environment and causes traffic problems. How to manage the allocation of bike-sharing and improve the city’s shared cycling system have become a highly discussed issue. We, taking Beijing as an example, research on the allocation of shared bikes by using the open source data provided by Amap, Baidu Map and websites of shared bikes, which are used to analyze the allocation, and establish an optimizing comprehensive evaluation model to evaluate the required level. In the end, we look forward the future of bike-sharing market.
As one of the most popular frameworks for large-scale analytics processing, Hadoop is facing two challenges: both applications and storage devices become heterogeneous. However, existing data placement and job scheduling schemes pay little attention to such heterogeneity of either application I/O requirements or I/O device capability, thus can greatly degrade system efficiencies. In this paper, we propose ASPS, an Application and Storage-aware data Placement and job Scheduling approach for Hadoop clusters. The idea is to place application data and schedule application tasks considering both application I/O requirements and storage device characteristics. Specifically, ASPS first introduces novel metrics to quantify I/O requirements of applications. Then, based on the quantification, ASPS places data of different applications to the preferred storage devices. Finally, ASPS tries to launch jobs with high I/O requirements on the nodes with the same type of faster devices to improve system efficiency. We have implemented ASPS in Hadoop framework. Experimental results show that ASPS can reduce the completion time of a single application by up to 36% and the average completion time of six concurrent applications by 27%, compared to existing data placement policies and job scheduling approaches.
In the Internet of Things (IoT) era, information is collected by sensor devices, resulting in data loss or uncertain data and other consequences. We need to represent the uncertain data collected using probabilities to extract the useful information for production and application from a huge indeterminate data warehouse. The data in the database has a particular order in time or space, so the High-Utility Probability Sequential Pattern Mining (HUPSPM) has become a new investigation and analysis topic in data processing. After the progress of timestamp, many efficient algorithms for sequential mining have been developed. However, these algorithms have a limitation: they can only be executed in a stand-alone environment and are only suitable for small datasets. Therefore, introducing an advanced graph framework for processing large datasets addresses the shortcomings of the existing methods. The proposed algorithm can avoid repeated database searching, splitting the database, and improve the parallel computing capability. The initial database is pruned according to the existing pruning strategy to effectively reduce the number of candidate sets effectively. Experiments show that the algorithm presented in this paper has excellent advantages in mining high-utility probability sequences in large datasets.
Interactive software can run not only independently but also often collaboratively to perform tasks thus forming a larger group of software networks. Hence the analysis of interactions is essential as a way to measure the stability of the entire software group network, i.e. the interactive patterns and frequency. However, current studies rarely investigate the performance of software as groups but as individuals thus omitting their interactions. Especially, the performance of some traditional measurement algorithms which execute in nondistributed runtime environments is poor. In this paper, we proposed a new software group stability model concentrating on software network level behaviors as a group. An algorithm is proposed to extract key nodes and critical interactive items based on frequent interaction pattern, then the stability of software group can be assessed based on the loss of connectivity caused by removing key nodes and key edges from the network, using the algorithm SG-StaMea. Furthermore, our algorithms can quantify the stability. To validate the efficacy of our model, the Spark and Hadoop platforms have been selected as targets systems. Both experiments and experimental data showed that our algorithms have significantly improved the accuracy of software stability measurement compared to classical algorithm such as Apriori of frequent pattern.
The era of the web has evolved and the industry strives to work better every day, the constant need for data to be accessible at a random moment is expanding, and with this expansion, the need to create a meaningful query technique in the web is a major concerns. To transmit meaningful data or rich semantics, machines/ projects need to have the ability to reach the correct information and make adequate connections, this problem is addressed after the emergence of Web 3.0, the semantic web is developing and being collected an immense. Information to prepare, this passes the giant data management test, to provide an ideal result at any time needed. Accordingly, in this article, we present an ideal system for managing huge information using MapReduce structures that internally help an engine bring information using the strength of fair preparation using smaller map occupations and connection disclosure measures. Calculations for similarity can be challenging, this work performs five similarity detection algorithms and determines the time it takes to address the patterns that has to be a better choice in the calculation decision. The proposed framework is created using the most recent and widespread information design, that is, the JSON design, the HIVE query language to obtain and process the information planned according to the customer’s needs and calculations for the disclosure of the interface. Finally, the results on a web page is made available that helps a user stack json information and make connections somewhere in the range of dataset 1 and dataset 2. The results are examined in 2 different sets, the results show that the proposed approach helps to interconnect significantly faster; Regardless of how large the information is, the time it takes is not radically extended. The results demonstrate the interlinking of the dataset 1 and dataset 2 is most notable using LD and JW, the time required is ideal in both calculations, this paper has mechanized the method involved with interconnecting via a web page, where customers can merge two sets of data that should be associated and used.
This article explains the knowledge properties for characterization and research outcomes. It deals with developing and integrating the Hadoop cloud computing platform. The platform utilized in this paper is a one-piece learning algorithm, a statistical model and a cloud-based selection model. Hadoop supports this model, which is suitable for data size computing. We develop Cloud Computing integrated Autocorrelation Function (CCiACF) computation model based on Hadoop learning and implement an efficient teaching platform research data processing and correlation system. Multiple simulations are conducted on the Hadoop platform under various operating conditions to check the training ability’s exactness and characteristics. The Spark Structure of this research is to successfully and effectively develop computational system performance and improve teaching platform research models using Hadoop based cloud computing. Various experimental studies have been performed, and findings indicate that the method proposed is highly successful in data collection, governance and analysis.
The article describes a solution to process large volumes of unstructured health social media data in a scalable fashion using the MapReduce framework. Our work is in the context of health informatics applications involving complex text and language processing as well as large resources such as ontologies, due to which the text processing of a single unit of text takes time. Even with a throughput of an order processing time of one second per unit, it takes over a week to process a million units, which is unacceptable. We present a solution where we take the processing to a MapReduce framework and achieve significant improvement in processing performance by dividing the processing across a cluster of processors. This paper describes the technical details of our work in terms of the design, modeling, and implementation of such an approach. We also present experimental results demonstrating the effectiveness of our approach.
Recently, MapReduce-based implementations of clustering algorithms have been developed to cope with the Big Data phenomenon, and they show promising results particularly for the document clustering problem. In this paper, we extend an efficient data partitioning method based on the relational analysis (RA) approach and applied to the document clustering problem, called PDC-Transitive. The introduced heuristic is parallelised using the MapReduce model iteratively and designed with a single reducer which represents a bottleneck when processing large data, we improved the design of the PDC-Transitive method to avoid the data dependencies and reduce the computation cost. Experiment results on benchmark datasets demonstrate that the enhanced heuristic yields better quality results and requires less computing time compared to the original method.
This paper describes an efficient algorithm for formal concepts generation in large formal contexts. While many algorithms exist for concept generation, they are not suitable for generating concepts efficiently on larger contexts. We propose an algorithm named as HaLoopUNCG algorithm based on MapReduce framework that uses a lightweight runtime environment called HaLoop. HaLoop, a modified version of Hadoop MapReduce, suits better for iterative algorithms over large datasets. Our approach uses the features of HaLoop efficiently to generate concepts in an iterative manner. First, we describe the theoretical concepts of formal concept analysis and HaLoop. Second, we provide a detailed representation of our work based on Lindig’s fast concept analysis algorithm using HaLoop and MapReduce framework. The experimental evaluations demonstrate that HaLoopUNCG algorithm is performing better than Hadoop version of upper neighbour concept generation (MRUNCG) algorithm, MapReduce implementation of Ganter’s next closure algorithm and other distributed implementations of concept generation algorithms.
Collaborative filtering-based recommendation systems have become significant in various domains due to their ability to provide personalised recommendations. In e-commerce, these systems analyse the browsing history and purchase patterns of users to recommend items. In the entertainment industry, collaborative filtering helps platforms like Netflix and Spotify recommend movies, shows and songs based on users’ past preferences and ratings. This technology also finds significance in online education, where it assists in suggesting relevant courses and learning materials based on a user’s interests and previous learning behaviour. Even though much research has been done in this domain, the problems of sparsity and scalability in collaborative filtering still exist. Data sparsity refers to too few preferences of users on items, and hence it would be difficult to understand users’ preferences. Recommendation systems must keep users engaged with fast responses, and hence there is a challenge in handling large data as these days it is growing quickly. Sparsity affects the recommendation accuracy, while scalability influences the complexity of processing the recommendations. The motivation behind the paper is to design an efficient algorithm to address the sparsity and scalability problems, which in turn provide a better user experience and increased user satisfaction. This paper proposes two separate, novel approaches that deal with both problems. In the first approach, an improved autoencoder is used to address sparsity, and later, its outcome is processed in a parallel and distributed manner using a MapReduce-based k-means clustering algorithm with the Elbow method. Since the k-means clustering technique uses a predetermined number of clusters, it may not improve accuracy. So, the elbow method identifies the optimal number of clusters for the k-means algorithm. In the second approach, a MapReduce-based Gaussian Mixture Model (GMM) with Expectation-Maximization (EM) is proposed to handle large volumes of sparse data. Both the proposed algorithms are implemented using MovieLens 20M and Netflix movie recommendation datasets to generate movie recommendations and are compared with the other state-of-the-art approaches. For comparison, metrics like RMSE, MAE, precision, recall, and F-score are used. The outcomes demonstrate that the second proposed strategy outperformed state-of-the-art approaches.
In this paper, we propose novel content-based image retrieval (CBIR) algorithms using Local Octa Patterns (LOtP), Local Hexadeca Patterns (LHdP) and Direction Encoded Local Binary Pattern (DELBP). LOtP and LHdP encode the relationship between center pixel and its neighbors based on the pixels’ direction obtained by considering the horizontal, vertical and diagonal pixels for derivative calculations. In DELBP, direction of a referenced pixel is determined by considering every neighboring pixel for derivative calculations which results in 256 directions. For this resultant direction encoded image, we have obtained LBP which is considered as feature vector. The proposed method’s performance is compared to that of Local Tetra Patterns (LTrP) using benchmark image databases viz., Corel 1000 (DB1) and Brodatz textures (DB2). Performance analysis shows that LOtP improves the average precision from 59.31% to 64.36% on DB1, and from 83.24% to 85.95% on DB2, LHdP improves it to 65.82% on DB1 and to 87.49% on DB2 and DELBP improves it to 60.35% on DB1 and to 86.12% on DB2 as compared to that of LTrP. Also, DELBP reduces the feature vector length by 66.62% as compared to that of LTrP. To reduce the retrieval time, the proposed algorithms are implemented on a Hadoop cluster consisting of 116 nodes and tested using Corel 10K (DB3), Mirflickr 100,000 (DB4) and ImageNet 511,380 (DB5) databases.
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research.
In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.
With the extensive use of smart devices and blooming popularity of social media websites such as Flickr, YouTube, Twitter, and Facebook, we have witnessed an explosion of multimedia data. The amount of data nowadays is formidable without effective big data technologies. It is well-acknowledged that multimedia high-level semantic concept mining and retrieval has become an important research topic; while the semantic gap (i.e., the gap between the low-level features and high-level concepts) makes it even more challenging. To address these challenges, it requires the joint research efforts from both big data mining and multimedia areas. In particular, the correlations among the classes can provide important context cues to help bridge the semantic gap. However, correlation discovery is computationally expensive due to the huge amount of data. In this paper, a novel multimedia big data mining system based on the MapReduce framework is proposed to discover negative correlations for semantic concept mining and retrieval. Furthermore, the proposed multimedia big data mining system consists of a big data processing platform with Mesos for efficient resource management and with Cassandra for handling data across multiple data centers. Experimental results on the TRECVID benchmark datasets demonstrate the feasibility and the effectiveness of the proposed multimedia big data mining system with negative correlation discovery for semantic concept mining and retrieval.