Please login to be able to save your searches and receive alerts for new content matching your search criteria.
With the advent of the big data era, data-driven decision-making and analysis are increasingly valued in various fields. Especially in the field of education, how to use big data technology to better understand student needs, optimize the education process, and improve education quality has become an important research topic. This paper will explore the application of decision trees and related analysis algorithms in the analysis of college students’ physical fitness, in order to provide scientific basis for improving the physical health level of college students. This paper studies the application of DT (decision tree) and correlation analysis algorithm in the analysis of college students’ physical fitness. In this paper, the method of big data and DM (data mining) is proposed to extract the rules contained in the data information, so as to directly provide auxiliary decision-making for physical fitness test and analysis. The research results show that through training the training set, a good classification accuracy rate is achieved, and through optimizing the depth, the accuracy rate can reach more than 85.033%. Using DM technology as a carrier, this paper digs into the rules behind the new knowledge of college students’ physical fitness, and digs out the previously unknown, implied and potentially useful information and knowledge.
In the research area of community-based question and answering (cQA) services such as Yahoo! Answers, reuse of the answers has attracted more and more interest. Most researchers focus on the correctness of answers and pay little attention to the completeness of answers. In this paper, we try to address the answer completeness problem for "survey questions" for which answer completeness is crucial. We propose to generate a more complete answer from replies in cQA services through a term hierarchical structure based question-oriented extractive summarization which is different from traditional query-based extractive summarization. The experimental results are very promising in terms of recall, precision and conciseness.
Internet of Things (IoT) devices built on different processor architectures have increasingly become targets of adversarial attacks. In this paper, we propose an algorithm for the malware classification problem of the IoT domain to deal with the increasingly severe IoT security threats. Application executions are represented by sequences of consecutive API calls. The time series of data is analyzed and filtered based on the improved information gains. It performs more effectively than chi-square statistics, in reducing the sequence lengths of input data meanwhile keeping the important information, according to the experimental results. We use a multi-layer convolutional neural network to classify various types of malwares, which is suitable for processing time series data. When the convolution window slides down the time sequence, it can obtain higher-level positions by collecting different sequence features, thereby understanding the characteristics of the corresponding sequence position. By comparing the iterative efficiency of different optimization algorithms in the model, we select an algorithm that can approximate the optimal solution to a small number of iterations to speed up the convergence of the model training. The experimental results from real world IoT malware sample show that the classification accuracy of this approach can reach more than 98%. Overall, our method has demonstrated practical suitability for IoT malware classification with high accuracies and low computational overheads by undergoing a comprehensive evaluation.
Clustering techniques are used to split data into clusters where each cluster contains elements that look more similar to elements in the same cluster than elements in other clusters. Some of these techniques are capable of handling clustering process uncertainty, while other techniques may have stability issues. In this paper, a novel method, called Minimum Information Gain Roughness (MIGR), is proposed to select the clustering attribute based on information entropy with rough set theory. To evaluate its performance, three benchmark UCI datasets are chosen to be clustered by using MIGR. Then, the resulting clusters are compared to those which are resulted from applying Min-Min-Rough (MMR) and information-theoretic dependency roughness (ITDR) algorithms. Both last-mentioned techniques were already compared with a variety of clustering algorithms like k-modes, fuzzy centroids, and fuzzy k-modes. The Global purity, the overall purity, and F-measure are considered here as performance measures to compare the quality of the resulting clusters. The experimental results show that the MIGR algorithm outperforms both MMR and ITDR algorithms for clustering categorical data.
Classification is one of the major tasks in data mining which aims to build classifiers for decision making. One of the most recent online threats is phishing, which has caused significant losses to online shoppers, electronic businesses and financial institutions. A common way of phishing is impersonating online websites to deceive online users and steal their financial information. One way to guide the anti-phishing classification method is to preliminarily identify a minimal set of related features so the search space can be reduced. The aim of this paper is to compare different features assessment techniques in the website phishing context in order to determine the minimal set of features for detecting phishing activities. Experimental results on real phishing datasets consisting of 30 features has been conducted using three known features selection methods. New features cutoffs have been identified after statistical analysis utilising three data mining classification methods. We have been able to identify new clusters of features that when used together are able to detect phishing activities. Further, important correlations among common features have been derived.
Background: Tumor purity is of great significance for the study of tumor genotyping and the prediction of recurrence, which is significantly affected by tumor heterogeneity. Tumor heterogeneity is the basis of drug resistance in various cancer treatments, and DNA methylation plays a core role in the generation of tumor heterogeneity. Almost all types of cancer cells are associated with abnormal DNA methylation in certain regions of the genome. The selection of tumor-related differential methylation sites, which can be used as an indicator of tumor purity, has important implications for purity assessment. At present, the selection of information sites mostly focuses on inter-tumor heterogeneity and ignores the heterogeneity of tumor growth space that is sample specificity.
Results: Considering the specificity of tumor samples and the information gain of individual tumor sample relative to the normal samples, we present an approach, PESM, to evaluate the tumor purity through the specificity difference methylation sites of tumor samples. Applied to more than 200 tumor samples of Prostate adenocarcinoma (PRAD) and Kidney renal clear cell carcinoma (KIRC), it shows that the tumor purity estimated by PESM is highly consistent with other existing methods. In addition, PESM performs better than the method that uses the integrated signal of methylation sites to estimate purity. Therefore, different information sites selection methods have an important impact on the estimation of tumor purity, and the selection of sample specific information sites has a certain significance for accurate identification of tumor purity of samples.
A quantum measurement process, when non-trivial, is not a closed evolution: the appearance of classical outcomes is usually interpreted as the evidence of some decoherence-like mechanism causing quantum superpositions to degrade into classical mixtures. Such mechanism is due to a net flow of information from the input system (measurement object), through the physical apparatus interacting with the object (measurement probe), into some environment, the latter representing all those degrees of freedom which are not directly accessible by the experimenter. For this reason, the phenomenon of state reduction induced by the measurement process generally entails an irreversible state change. The aim of our contribution is to answer the following questions : how much information a measurement is able to extract? "How much" irreversible is the state reduction due to a particular measurement process? In which way information gain and irreversibility are related?
Malaria is a life-threatening mosquito-borne disease. Recently, the number of malaria cases has increased worldwide, threatening vulnerable populations. Malaria is responsible for a high rate of morbidity and mortality in people all around the world. Each year, many people, die from this disease, according to the World Health Organization (WHO). Thick and thin blood smears are used to determine parasite habitation and computer-aided diagnosis (CADx) techniques using machine learning (ML) are being used to assist. CADx reduces traditional diagnosis time, lessens socio-economic impact, and improves quality of life. This study develops a simplified model with selective features to reduce processing power and further shorten diagnostic time, which is important to resource-constrained areas. To improve overall classification results, we use a decision tree (DT)-based approach with image pre-processing called optimal features to identify optimal features. Various feature selection and extraction techniques are used, including information gain (IG). Our proposed model is compared to a benchmark state-of-art classification model. For an unseen dataset, our proposed model achieves accuracy, precision, recall, F-score, and processing time of 0.956, 0.949, 0.964, 0.956, and 9.877 s, respectively. Furthermore, our proposed model’s training time is less than those of the state-of-the-art classification model, while the performance metrics are comparable.
Decision tree-based algorithms serve as the fundamental step in application of the decision tree method, which is a predictive modeling technique for classification of data. This chapter provides a broad overview of decision tree-based algorithms that are among the most commonly used methods for constructing classifiers.