Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Linking named entities to structured knowledge sources paves the way for state-of-the-art Web intelligence applications which assign sentiment to the correct entities, identify trends, and reveal relations between organizations, persons and products. For this purpose this paper introduces Recognyze, a named entity linking component that uses background knowledge obtained from linked data repositories, and outlines the process of transforming heterogeneous data silos within an organization into a linked enterprise data repository which draws upon popular linked open data vocabularies to foster interoperability with public data sets. The presented examples use comprehensive real-world data sets from Orell Füssli Business Information, Switzerland's largest business information provider. The linked data repository created from these data sets comprises more than nine million triples on companies, the companies' contact information, key people, products and brands. We identify the major challenges of tapping into such sources for named entity linking, and describe required data pre-processing techniques to use and integrate such data sets, with a special focus on disambiguation and ranking algorithms. Finally, we conduct a comprehensive evaluation based on business news from the New Journal of Zurich and AWP Financial News to illustrate how these techniques improve the performance of the Recognyze named entity linking component.
This paper performed some exploratory data visualization on this data set. The nature and representation of input data was found out and the preliminary feature selection was conducted in this step. And this paper performed data preprocessing and feature engineering on this data set, which had critical importance of the accuracy of prediction results. The paper built multiple regression models on the missing values prediction in the testing set. The paper implemented various data mining algorithms to build predictive models, including Gaussian Naive Bayes classifier, K-Nearest Neighbors (K-NN) algorithm, Multi-layer Perceptron (MLP), Logistic regression, random forest and XGBoost. After the experiments, XGBoost classifier could give the best result among all the models.
Diabetes is a chronic disease which indicates the high level of body glucose level. As per the World Health Organization (WHO), 422 million people were diabetic until 2014. This paper develops an accurate classification machine learning model and an efficient usage of data pre-processing pipeline to improve overall accuracy. For the purpose, six algorithms: Support Vector Machine with Linear kernel (Linear-SVM), Support Vector Machine with RBF kernel (RBF-SVM), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), Decision Tree and Random Forest are used for classification purpose and their comparative accuracy is analyzed. Data Imputation, Oversampling and Feature scaling techniques are the constituents of Data preprocessing pipeline. Experiments are performed on a well-known dataset of National Institute of Diabetes and Digestive and Kidney Diseases, the PIMA diabetes dataset. The data preprocessing techniques, data imputation and Synthetic Minority Oversample Technique (SMOTE) analysis improved classification accuracy from 77% on raw data, to 88.12% (on Random Forest Classifier) and 91% (on ANN Classifier), respectively. Furthermore, a new feature generation approach is applied and performance is analyzed using the SVM model. Original data attributes BMI and Insulin are replaced with new features BMI_NORMAL and INSULIN_NORMAL, respectively. The significant improvement by proposed technique is confirmed by statistical testing followed by post-hoc analysis.
Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.
The Convolutional Neural Networks (CNN’s) have made incredible progress in numerous research areas. However, the exponential development of digital images causes over-burdening due to irrelevant features, heavy redundancy and noisy data. Hence, affecting the processing speed of the CNN and its classification accuracy. In this study, a novel reduction algorithm using rough set theory with no information loss has been proposed as a data pre-processor for CNN. The proposed algorithm is reducing the data by feature reduction and noisy sample reduction. The Rough set could sufficiently choose the noisy boundary samples to take out based on KNN rules, having mislabeled classes. Experiments have been performed to demonstrate the proposed which can increase the overall performance of convolutional neural networks.