Loading [MathJax]/jax/output/CommonHTML/jax.js
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    Transcriptomics: Quantifying Non-Uniform Read Distribution Using MapReduce

    RNA-seq is a high-throughput Next-sequencing technique for estimating the concentration of all transcripts in a transcriptome. The method involves complex preparatory and post-processing steps which can introduce bias, and the technique produces a large amount of data [7, 19]. Two important challenges in processing RNA-seq data are therefore the ability to process a vast amount of data, and methods to quantify the bias in public RNA-seq datasets. We describe a novel analysis method, based on analysing sequence motif correlations, that employs MapReduce on Apache Spark to quantify bias in Next-generation sequencing (NGS) data at the deep exon level. Our implementation is designed specifically for processing large datasets and allows for scalability and deployment on cloud service providers offering MapReduce. In investigating the wild and mutant organism types in the species D. melanogaster we have found that motifs with runs of Gs (or their complement) exhibit low motif-pair correlations in comparison with other motif-pairs. This is independent of the mean exon GC content in the wild type data, but there is a mild dependence in the mutant data. Hence, whilst both datasets show the same trends, there is however significant variation between the two samples.

  • articleNo Access

    MR-ARM: A MAP-REDUCE ASSOCIATION RULE MINING FRAMEWORK

    Association rule is one of the primary tasks in data mining that discovers correlations among items in a transactional database. The majority of vertical and horizontal association rule mining algorithms have been developed to improve the frequent items discovery step which necessitates high demands on training time and memory usage particularly when the input database is very large. In this paper, we overcome the problem of mining very large data by proposing a new parallel Map-Reduce (MR) association rule mining technique called MR-ARM that uses a hybrid data transformation format to quickly finding frequent items and generating rules. The MR programming paradigm is becoming popular for large scale data intensive distributed applications due to its efficiency, simplicity and ease of use, and therefore the proposed algorithm develops a fast parallel distributed batch set intersection method for finding frequent items. Two implementations (Weka, Hadoop) of the proposed MR association rule algorithm have been developed and a number of experiments against small, medium and large data collections have been conducted. The ground bases of the comparisons are time required by the algorithm for: data initialisation, frequent items discovery, rule generation, etc. The results show that MR-ARM is very useful tool for mining association rules from large datasets in a distributed environment.

  • articleNo Access

    Parallel Associative Classification Data Mining Frameworks Based MapReduce

    Associative classification (AC) is a research topic that integrates association rules with classification in data mining to build classifiers. After dissemination of the Classification-based Association Rule algorithm (CBA), the majority of its successors have been developed to improve either CBA's prediction accuracy or the search for frequent ruleitems in the rule discovery step. Both of these steps require high demands in processing time and memory especially in cases of large training data sets or a low minimum support threshold value. In this paper, we overcome the problem of mining large training data sets by proposing a new learning method that repeatedly transforms data between line and item spaces to quickly discover frequent ruleitems, generate rules, subsequently rank and prune rules. This new learning method has been implemented in a parallel Map-Reduce (MR) algorithm called MRMCAR which can be considered the first parallel AC algorithm in the literature. The new learning method can be utilised in the different steps within any AC or association rule mining algorithms which scales well if contrasted with current horizontal or vertical methods. Two versions of the learning method (Weka, Hadoop) have been implemented and a number of experiments against different data sets have been conducted. The ground bases of the comparisons are classification accuracy and time required by the algorithm for data initialization, frequent ruleitems discovery, rule generation and rule pruning. The results reveal that MRMCAR is superior to both current AC mining algorithms and rule based classification algorithms in improving the classification performance with respect to accuracy.

  • articleNo Access

    Modified Delay Scheduling: A Heuristic Approach for Hadoop Scheduling to Improve Fairness and Response Time

    Hadoop is a widely used open source implementation of MapReduce which is a popular programming model for parallel processing large scale data intensive applications in a cloud environment. Sharing Hadoop clusters has a tradeoff between fairness and data locality. When launching a local task is not possible, Hadoop Fair Scheduler (HFS) with delay scheduling postpones the node allocation for a while to a job which is to be scheduled next as per fairness to achieve high locality. This waiting becomes waste when the desired locality could not be achieved within a reasonable period. In this paper, a modified delay scheduling in HFS is proposed and implemented in Hadoop. It avoids the aforementioned waiting of delay scheduler if achieving locality is not possible. Instead of blindly waiting for a local node, the proposed algorithm first estimates the time to wait for a local node for the job and avoids waiting wherever achieving locality is not possible within the predefined delay threshold while accomplishing same locality. The performance of the proposed algorithm is evaluated by extensive experiments and it has been observed that the algorithm works significantly better in terms of response time and fairness achieving up to 20% speedup and up to 38% fairness in certain cases.

  • articleOpen Access

    Graph Connectivity in Log Steps Using Label Propagation

    The fastest deterministic algorithms for connected components take logarithmic time and perform superlinear work on a Parallel Random Access Machine (PRAM). These algorithms maintain a spanning forest by merging and compressing trees, which requires pointer-chasing operations that increase memory access latency and are limited to shared-memory systems. Many of these PRAM algorithms are also very complicated to implement. Another popular method is “leader-contraction” where the challenge is to select a constant fraction of leaders that are adjacent to a constant fraction of non-leaders with high probability, but this can require adding more edges than were in the original graph. Instead we investigate label propagation because it is deterministic, easy to implement, and does not rely on pointer-chasing. Label propagation exchanges representative labels within a component using simple graph traversal, but it is inherently difficult to complete in a sublinear number of steps. We are able to overcome the problems with label propagation for graph connectivity.

    We introduce a surprisingly simple framework for deterministic, undirected graph connectivity using label propagation that is easily adaptable to many computational models. It achieves logarithmic convergence independently of the number of processors and without increasing the edge count. We employ a novel method of propagating directed edges in alternating direction while performing minimum reduction on vertex labels. We present new algorithms in PRAM, Stream, and MapReduce. Given a simple, undirected graph G=(V,E) with n=|V| vertices, m=|E| edges, our approach takes O(m) work each step, but we can only prove logarithmic convergence on a path graph. It was conjectured by Liu and Tarjan (2019) to take O(logn) steps or possibly O(log2n) steps. Our experiments on a range of difficult graphs also suggest logarithmic convergence. We leave the proof of convergence as an open problem.

  • articleNo Access

    An Optimal Preemptive Algorithm for Online MapReduce Scheduling on Two Parallel Machines

    In this paper, we study an online scheduling on two parallel machines in MapReduce-like system where each job contains two kinds of tasks: map tasks and reduce tasks. A job’s reduce tasks can only be processed after all its map tasks are finished. We assume that the map tasks are fractional and the reduce tasks are preemptive. Our objective is to minimize makespan. We show that the lower bound for this MapReduce scheduling problem is 2. We then present an online algorithm with competitive ratio of 2 and thus it is optimal.

  • articleNo Access

    An integrated framework for anomaly detection in big data of medical wireless sensors

    Wireless sensor networks (WSNs) are ubiquitous nowadays and have applications in variety of domains such as machine surveillance, precision agriculture, intelligent buildings, healthcare etc. Detection of anomalous activities in such domains has always been a subject undergoing intense study. As the sensor networks are generating tons of data every second, it becomes a challenging task to detect anomalous events accurately from this large amount of data. Most of the existing techniques for anomaly detection are not scalable to big data. Also, sometimes accuracy might get compromised while dealing with such a large amount of data. To address these issues in this paper, a unified framework for anomaly detection in big sensor data has been proposed. The proposed framework is based on data compression and Hadoop MapReduce-based parallel fuzzy clustering. The clusters are further refined for better classification accuracy. The modules of the proposed framework are compared with various existing state-of-art algorithms. For experimental analysis, real sensor data of ICU patients has been taken from the physionet library. It is revealed from the comparative analysis that the proposed framework is more time efficient and shows better classification accuracy.

  • articleNo Access

    An independent time optimized hybrid infrastructure for big data analytics

    In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run K-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.

  • articleNo Access

    An Accurate Sequence Assembly Algorithm for Livestock, Plants and Microorganism Based on Spark

    Sequence Assembly is one of the important topics in bioinformatics research. Sequence assembly algorithm has always met the problems of poor assembling precision and low efficiency. In view of these two problems, this paper designs and implements a precise assembling algorithm under the strategy of finding the source of reads based on the MapReduce (SA-BR-MR) and Eulerian path algorithm. Computational results show that SA-BR-MR is more accurate than other algorithms. At the same time, SA-BR-MR calculates 54 sequences which are randomly selected from animals, plants and microorganisms with base lengths from hundreds to tens of thousands from NCBI. All matching rates of the 54 sequences are 100%. For each species, the algorithm summarizes the range of K which makes the matching rates to be 100%. In order to verify the range of K value of hepatitis C virus (HCV) and related variants, the randomly selected eight HCV variants are calculated. The results verify the correctness of K range of hepatitis C and related variants from NCBI. The experiment results provide the basis for sequencing of other variants of the HCV. In addition, Spark platform is a new computing platform based on memory computation, which is featured by high efficiency and suitable for iterative calculation. Therefore, this paper designs and implements sequence assembling algorithm based on the Spark platform under the strategy of finding the source of reads (SA-BR-Spark). In comparison with SA-BR-MR, SA-BR-Spark shows a superior computational speed.

  • articleNo Access

    The Application of FP-Growth Algorithm Based on Distributed Intelligence in Wisdom Medical Treatment

    FP-Growth algorithm is an algorithm of association rules that does not generate a set of candidate, so it has very high practical value in face of the rapid growth of data volume in wisdom medical treatment. Because FP-Growth is a memory-resident algorithm, it will appear to be powerless when it is used for massive data sets. The paper combines Hadoop and FP-Growth algorithm and through the actual analysis of traditional Chinese medicine (TCM) data compares the performance in two different environments of stand-alone and distributed. The experimental results show that FP-Growth algorithm has a great advantage in the processing and execution of massive data after the MapReduce parallel model, so that it will have better development prospects for intelligent medical treatment.

  • articleNo Access

    STiMR k-Means: An Efficient Clustering Method for Big Data

    Big Data clustering has become an important challenge in data analysis since several applications require scalable clustering methods to organize such data into groups of similar objects. Given the computational cost of most of the existing clustering methods, we propose in this paper a new clustering method, referred to as STiMR k-means, able to provide good tradeoff between scalability and clustering quality. The proposed method is based on the combination of three acceleration techniques: sampling, triangle inequality and MapReduce. Sampling is used to reduce the number of data points when building cluster prototypes, triangle inequality is used to reduce the number of comparisons when looking for nearest clusters and MapReduce is used to configure a parallel framework for running the proposed method. Experiments performed on simulated and real datasets have shown the effectiveness of the proposed method, with the existing ones, in terms of running time, scalability and internal validity measures.

  • articleNo Access

    A Novel Strategy for Retrieving Large Scale Scene Images Based on Emotional Feature Clustering

    Due to complicated data structure, image can present rich information, and so images are applied widely at different fields. Although the image can offer a lot of convenience, handling such data consume much time and multi-dimensional space. Especially when users need to retrieve some images from larger-scale image datasets, the disadvantage is more obvious. So, in order to retrieve larger-scale image data effectively, a scene images retrieval strategy based on the MapReduce parallel programming model is proposed. The proposed strategy first, investigates how to effectively store large-scale scene images under a Hadoop cluster parallel processing architecture. Second, a distributed feature clustering algorithm MeanShift is introduced to implement the clustering process of emotional feature of scene images. Finally, several experiments are conducted to verify the effectiveness and efficiency of the proposed strategy in terms of different aspects such as retrieval accuracy, speedup ratio and efficiency and data scalability.

  • articleNo Access

    Intrusion Detection Based on Dynamic Gemini Population DE-K-mediods Clustering on Hadoop Platform

    In view of the fact that the existing intrusion detection system (IDS) based on clustering algorithm cannot adapt to the large-scale growth of system logs, a K-mediods clustering intrusion detection algorithm based on differential evolution suitable for cloud computing environment is proposed. First, the differential evolution algorithm is combined with the K-mediods clustering algorithm in order to use the powerful global search capability of the differential evolution algorithm to improve the convergence efficiency of large-scale data sample clustering. Second, in order to further improve the optimization ability of clustering, a dynamic Gemini population scheme was adopted to improve the differential evolution algorithm, thereby maintaining the diversity of the population while improving the problem of being easily trapped into a local optimum. Finally, in the intrusion detection processing of big data, the optimized clustering algorithm is designed in parallel under the Hadoop Map Reduce framework. Simulation experiments were performed in the open source cloud computing framework Hadoop cluster environment. Experimental results show that the overall detection effect of the proposed algorithm is significantly better than the existing intrusion detection algorithms.

  • articleNo Access

    A Multi-Input File Data Symmetry Placement Method Considering Job Execution Frequency for MapReduce Join Operation

    In recent years, data-parallel computing frameworks such as Hadoop have become increasingly popular among scientists. Data-grouping-aware multiple input file data placement for Hadoop is becoming increasingly popular. However, we note that many data-grouping-aware data placement schemes for multiple input files do not take MapReduce job execution frequency into account. Through the study, such data placement schemes will increase the data transmission between nodes. The starting point of this paper is that if a certain type of MapReduce job has been executed more frequently recently, then it can be assumed that this type of job will also have a higher chance of being executed later. Based on this assumption, we proposed a data-grouping-aware multiple input files data symmetry placement method based on MapReduce jobs execution frequency (DGAMF). Based on the history of MapReduce job executions, this method first creates an inter-block join access correlation model, then divides the correlated blocks into groups according to this model and gives a mathematical model for data placement. The model can be used to guide the placement of data blocks centrally to solve the node load balancing issue caused by data asymmetry. Using the proposed method, correlated blocks from the same groups were placed in the same set of nodes, thereby effectively reducing the amount of data transmitted between nodes. Our data placement method was validated by setting up an experimental Hadoop environment. Experimental results showed that the proposed method effectively processed massive datasets and improved MapReduce’s efficiency significantly.

  • articleNo Access

    Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters

    As one of the most popular frameworks for large-scale analytics processing, Hadoop is facing two challenges: both applications and storage devices become heterogeneous. However, existing data placement and job scheduling schemes pay little attention to such heterogeneity of either application I/O requirements or I/O device capability, thus can greatly degrade system efficiencies. In this paper, we propose ASPS, an Application and Storage-aware data Placement and job Scheduling approach for Hadoop clusters. The idea is to place application data and schedule application tasks considering both application I/O requirements and storage device characteristics. Specifically, ASPS first introduces novel metrics to quantify I/O requirements of applications. Then, based on the quantification, ASPS places data of different applications to the preferred storage devices. Finally, ASPS tries to launch jobs with high I/O requirements on the nodes with the same type of faster devices to improve system efficiency. We have implemented ASPS in Hadoop framework. Experimental results show that ASPS can reduce the completion time of a single application by up to 36% and the average completion time of six concurrent applications by 27%, compared to existing data placement policies and job scheduling approaches.

  • articleNo Access

    Ensemble Model for Stock Price Forecasting: MapReduce Framework for Big Data Handling: An Optimal Trained Hybrid Model for Classification

    A number of authors have focused on this study to examine how huge data are perceived. A novel big data classification paradigm is introduced by the work’s preprocessing, feature extraction and classification techniques. Data normalization is carried out at the preprocessing stage. The MapReduce framework is then utilized to manage the massive data. Statistical features (mean, median, min/max and SD), higher-order statistical features (skewness, kurtosis and enhanced entropy), and correlation-based features are all extracted prior to classification. The Bi-LSTM and deep maxout hybrid classification model classifies the data during the reduction stage. To assure classification accuracy, training will also be deployed by the new Hybrid Butterfly Positioned Coot Optimization (HBPCO) algorithm. The proposed method’s accuracy of 97.45% beats the methods of NN (85.13%), CNN (83.78%), RNN (78.37%), Bi-LSTM (82.43%) and SVM (87.83%).

  • articleNo Access

    Coding Productivity in MapReduce Applications for Distributed and Shared Memory Architectures

    MapReduce was originally proposed as a suitable and efficient approach for analyzing and processing large amounts of data. Since then, many researches contributed with MapReduce implementations for distributed and shared memory architectures. Nevertheless, different architectural levels require different optimization strategies in order to achieve high-performance computing. Such strategies in turn have caused very different MapReduce programming interfaces among these researches. This paper presents some research notes on coding productivity when developing MapReduce applications for distributed and shared memory architectures. As a case study, we introduce our current research on a unified MapReduce domain-specific language with code generation for Hadoop and Phoenix++, which has achieved a coding productivity increase from 41.84% and up to 94.71% without significant performance losses (below 3%) compared to those frameworks.

  • articleNo Access

    Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms

    Big Data analytics is presently one of the most emerging areas of research for both organizations and enterprises. The requirement for deployment of efficient machine learning algorithms over huge amounts of data led to the development of parallelization frameworks and of specialized libraries (like Mahout and MLlib) which implement the most important among these algorithms. Moreover, the recent advances in storage technology resulted in the introduction of high-performing devices, broadly known as Solid State Drives (SSDs). Compared to the traditional Hard Drives (HDDs), SSDs offer considerably higher performance and lower power consumption. Motivated by these appealing features and the growing necessity for efficient large-scale data processing, we compared the performance of several machine learning algorithms on MapReduce clusters whose nodes are equipped with HDDs, SSDs, and devices which implement the latest 3D XPoint technology. In particular, we evaluate several dataset preprocessing methods like vectorization and dimensionality reduction, two supervised classifiers, Naive Bayes and Linear Regression, and the popular k-Means clustering algorithm. We use an experimental cluster equipped with the three aforementioned storage devices under different configurations, and two large datasets, Wikipedia and HIGGS. The experiments showed that the benefits which derive from the usage of SSDs depend on the cluster setup and the nature of the applied algorithms.

  • articleNo Access

    Kernel Optimized-Support Vector Machine and MapReduce Framework for Sentiment Classification of Train Reviews

    Sentiment analysis is one of the popular techniques gaining attention in recent times. Nowadays, people gain information on reviews of users regarding public transportation, movies, hotel reservation, etc., by utilizing the resources available, as they meet their needs. Hence, sentiment classification is an essential process employed to determine the positive and negative responses. This paper presents an approach for sentiment classification of train reviews using MapReduce model with the proposed Kernel Optimized-Support Vector Machine (KO-SVM) classifier. The MapReduce framework handles big data using a mapper, which performs feature extraction and reducer that classifies the review based on KO-SVM classification. The feature extraction process utilizes features that are classification-specific and SentiWordNet-based. KO-SVM adopts SVM for the classification, where the exponential kernel is replaced by an optimized kernel, finding the weights using a novel optimizer, Self-adaptive Lion Algorithm (SLA). In a comparative analysis, the performance of KO-SVM classifier is compared with SentiWordNet, NB, NN, and LSVM, using the evaluation metrics, specificity, sensitivity, and accuracy, with train review and movie review database. The proposed KO-SVM classifier could attain maximum sensitivity of 93.46% and 91.249% specificity of 74.485% and 70.018%; and accuracy of 84.341% and 79.611% respectively, for train review and movie review databases.

  • articleNo Access

    Feature Selection Using Games with Imperfect Information (FSGIN)

    Game Theory (GT) is the study of strategic decision making. By virtue of its importance, several GT based methodologies for Feature Selection (FS) are proposed in recent times. FS problem can be abstracted as a game by considering each feature as a player and their values as their strategies. Additionally, overall goal of the game is set to classify a data instance appropriately. Most of the existing GT based FS techniques are restricted to Zero Sum Games, Non-Zero Sum Games and Cooperative Games. The classical setting of assuming that all the details of players are known to all players cannot hold in many real-world problems. When the given features are independent, they cannot be treated alike and a characteristics based uncertainty persists among the features. This uncertainty is handled by none of the game forms used in the existing methods. Unlike the mentioned game techniques, Bayesian Games (BG) address the games with imperfect information. This paper investigate the FS problem in terms of BG and proposes a novel method to select the best features. The proposed BG based FS method is a filter type FS method and it starts with identifying Principle Features (PF) and proceeds to play global pairwise Bayesian games between those PF to obtain feature scores. Later, the features are ranked using these scores. In the final stage, a forward selection method with Support Vector Machine (SVM) is used to evaluate the classification performance of the ranked features and helps in the selection of the optimal set of features. Besides these, to improve the scalability of the proposed method MapReduce paradigm is exploited. In order to show the efficacy of the proposed method, experiments are carried out with seven real-world datasets from UCI and Statlog repositories. The promising results showed a significant improvement in the classification performance with fewer selected features than which is achieved using the existing methods.