Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this study, we compare the performance of four different imputation strategies ranging from the commonly used Listwise Deletion to model based approaches such as the Maximum Likelihood on enhancing completeness in incomplete software project data sets. We evaluate the impact of each of these methods by implementing them on six different real-time software project data sets which are classified into different categories based on their inherent properties. The reliability of the constructed data sets using these techniques are further tested by building prediction models using stepwise regression. The experimental results are noted and the findings are finally discussed.
This paper presents an integrated approach by considering chemical reaction optimization (CRO) and functional link artificial neural networks (FLANNs) for building a classifier from the dataset with missing value, inconsistent records, and noisy instances. Here, imputation is carried out based on the known value of two nearest neighbors to address dataset plagued with missing values. The probabilistic approach is used to remove the inconsistency from either of the datasets like original or imputed. The resulting dataset is then given as an input to boosted instance selection approach for selection of relevant instances to reduce the size of the dataset without loss of generality and compromising classification accuracy. Finally, the transformed dataset (i.e., from non-imputed and inconsistent dataset to imputed and consistent dataset) is used for developing a classifier based on CRO trained FLANN. The method is evaluated extensively through a few bench-mark datasets obtained from University of California, Irvine (UCI) repository. The experimental results confirm that our preprocessing tasks along with integrated approach can be a promising alternative tool for mitigating missing value, inconsistent records, and noisy instances.
Nuclear safeguards evaluation aims to verify that countries are not misusing nuclear programs for nuclear weapons purposes. Experts of the International Atomic Energy Agency (IAEA) carry out an evaluation process in which several hundreds of indicators are assessed according to the information obtained from different sources, such as State declarations, on-site inspections, IAEA non-safeguards databases and other open sources. These assessments are synthesized in a hierarchical way to obtain a global assessment. Much information and many sources of information related to nuclear safeguards are vague, imprecise and ill-defined. The use of the fuzzy linguistic approach has provided good results to deal with such uncertainties in this type of problems. However, a new challenge on nuclear safeguards evaluation has attracted the attention of researchers. Due to the complexity and vagueness of the sources of information obtained by IAEA experts and the huge number of indicators involved in the problem, it is common that they cannot assess all of them appearing missing values in the evaluation, which can bias the nuclear safeguards results. This paper proposes a model based on collaborative filtering (CF) techniques to impute missing values and provides a trust measure that indicates the reliability of the nuclear safeguards evaluation with the imputed values.
Consider the cooperative coalition games with side payments. Bargaining sets are calculated for all possible coalition structures to obtain a collection of imputations rather than single imputation. Our aim is to obtain a single payoff vector, which is acceptable by all players of the game under grand coalition. Though Shapely value is a single imputation, it is based on fair divisions rather than bargaining considerations. So, we present a method to obtain a single imputation based on bargaining considerations.
With the rapid explosion of the data streams from the applications, ensuring accurate data analysis is essential for effective real-time decision making. Nowadays, data stream applications often confront the missing values that affect the performance of the classification models. Several imputation models have adopted the deep learning algorithms for estimating the missing values; however, the lack of parameter and structure tuning in classification, degrade the performance for data imputation. This work presents the missing data imputation model using the adaptive deep incremental learning algorithm for streaming applications. The proposed approach incorporates two main processes: enhancing the deep incremental learning algorithm and enhancing deep incremental learning-based imputation. Initially, the proposed approach focuses on tuning the learning rate with both the Adaptive Moment Estimation (Adam) along with Stochastic Gradient Descent (SGD) optimizers and tuning the hidden neurons. Secondly, the proposed approach applies the enhanced deep incremental learning algorithm to estimate the imputed values in two steps: (i) imputation process to predict the missing values based on the temporal-proximity and (ii) generation of complete IoT dataset by imputing the missing values from both the predicted values. The experimental outcomes illustrate that the proposed imputation model effectively transforms the incomplete dataset into a complete dataset with minimal error.
Missing values in time series data is a well-known and important problem which many researchers have studied extensively in various fields. In this paper, a new nonparametric approach for missing value imputation in time series is proposed. The main novelty of this research is applying the L1 norm-based version of Singular Spectrum Analysis (SSA), namely L1-SSA which is robust against outliers. The performance of the new imputation method has been compared with many other established methods. The comparison is done by applying them to various real and simulated time series. The obtained results confirm that the SSA-based methods, especially L1-SSA can provide better imputation in comparison to other methods.
Two famous matrix factorization techniques, the Singular Value Decomposition (SVD) and the Nonnegative Matrix Factorization (NMF), are popularly used by recommender system applications. Recommender system data matrices have many missing entries, and to make them suitable for factorization, the missing entries need to be filled. For matrix completion, we use mean, median and mode as three different cases of imputation. The natural clusters produced after factorization are used to formulate simple out-of-sample extension algorithms and methods to generate recommendation for a new user. Two cluster evaluation measures, Normalized Mutual Information (NMI) and Purity are used to evaluate the quality of clusters.
Missing data remain the common issue experienced in the real-world environment, which leads to deviation in data analysis and mining. Therefore, in order to lessen the consequences of missing data caused by human mistake, missing data imputation must be used in data processing. The traditional imputation model fails to satisfy the evaluation requirement due to its poor stability and low accuracy. Further, these models compromise the imputation accuracy of the increasing number of missing information. Hence, in this research, an optimized missing data imputation model is proposed using the Socio-hawk optimization Deep Neural Network (DNN). In this research, the DNN extracts the important features from the data, in which the missing data are estimated with an arbitrary missing pattern. It is stated that whenever the hyperparameters are tuned properly, the DNN’s performance is improved. The key here is the efficient training of DNN using the suggested Socio-hawk optimization, which improves the imputation model’s accuracy. To determine how well the suggested imputation model imputes missing data, it is compared to other methods. As a result, the paper’s primary contribution is to effectively train DNN using the suggested Socio-hawk optimization that reduces the error rate of the imputation model. The experimental evaluation shows that the proposed missing data imputation model attains a high performance at 90%, which provides 1.0595, 1.9919, and 0.9421 of MAE, MSE, and MAPE.
Prostate Specific Antigen (PSA) level in the serum is one of the most widely used markers in monitoring prostate cancer (PCa) progression, treatment response, and disease relapse. Although significant efforts have been taken to analyze various socioeconomic and cultural factors that contribute to the racial disparities in PCa, limited research has been performed to quantitatively understand how and to what extent molecular alterations may impact differential PSA levels present at varied tumor status between African–American and European–American men. Moreover, missing values among patients add another layer of difficulty in precisely inferring their outcomes. In light of these issues, we propose a data-driven, deep learning-based imputation and inference framework (DIIF). DIIF seamlessly encapsulates two modules: an imputation module driven by a regularized deep autoencoder for imputing critical missing information and an inference module in which two deep variational autoencoders are coupled with a graphical inference model to quantify the personalized and race-specific causal effects. Large-scale empirical studies on the independent sub-cohorts of The Cancer Genome Atlas (TCGA) PCa patients demonstrate the effectiveness of DIIF. We further found that somatic mutations in TP53, ATM, PTEN, FOXA1, and PIK3CA are statistically significant genomic factors that may explain the racial disparities in different PCa features characterized by PSA.
Single-cell RNA sequencing (scRNA-seq) has been proven to be an effective technology for investigating the heterogeneity and transcriptome dynamics due to the single-cell resolution. However, one of the major problems for data obtained by scRNA-seq is excessive zeros in the count matrix, which hinders the downstream analysis enormously. Here, we present a method that integrates non-negative matrix factorization and transfer learning (NMFTL) to impute the scRNA-seq data. It borrows gene expression information from the additional dataset and adds graph-regularized terms to the decomposed matrices. These strategies not only maintain the intrinsic geometrical structure of the data itself but also further improve the accuracy of estimating the expression values by adding the transfer term in the model. The real data analysis result demonstrates that the proposed method outperforms the existing matrix-factorization-based imputation methods in recovering dropout entries, preserving gene-to-gene and cell-to-cell relationships, and in the downstream analysis, such as cell clustering analysis, the proposed method also has a good performance. For convenience, we have implemented the “NMFTL” method with R scripts, which could be available at https://github.com/FocusPaka/NMFTL.
Missing data are encountered in many researches and they are also found in well-conducted and controlled studies. Missing data can reduce the statistical strength of a study and may produce biased estimates, leading to invalid conclusions. This study is focused on the problems and types of missing data, together with the techniques for their approach. The mechanisms by which the missing data are obtained and the methods to study these data are illustrated. We have dealt with the multiple imputations as a very efficient method of imputing the missing data and applying these methods in some simulation cases and in real data time series. We have also prepared and adapted the scripts in the programming language R to conduct the simulations. The proposed mice and Amelia packages for imputing the missing values provide fairly good approximations even in the case of real data.
The growth of publicly available repositories, such as the Gene Expression Omnibus, has allowed researchers to conduct meta-analysis of gene expression data across distinct cohorts. In this work, we assess eight imputation methods for their ability to impute gene expression data when values are missing across an entire cohort of Tuberculosis (TB) patients. We investigate how varying proportions of missing data (across 10%, 20%, and 30% of patient samples) influence the imputation results, and test for significantly differentially expressed genes and enriched pathways in patients with active TB. Our results indicate that truncating to common genes observed across cohorts, which is the current method used by researchers, results in the exclusion of important biology and suggest that LASSO and LLS imputation methodologies can reasonably impute genes across cohorts when total missingness rates are below 20%.
The incompleteness of race and ethnicity information in real-world data (RWD) hampers its utility in promoting healthcare equity. This study introduces two methods—one heuristic and the other machine learning-based—to impute race and ethnicity from genetic ancestry using tumor profiling data. Analyzing de-identified data from over 100,000 cancer patients sequenced with the Tempus xT panel, we demonstrate that both methods outperform existing geolocation and surname-based methods, with the machine learning approach achieving high recall (range: 0.859-0.993) and precision (range: 0.932-0.981) across four mutually exclusive race and ethnicity categories. This work presents a novel pathway to enhance RWD utility in studying racial disparities in healthcare.
Missing values are an often-alleged incumbency to the effectiveness of successful data analysis. Their presence able to be explained or not may be the issue, the very least acknowledged. This study discusses the extant issues of the presence of the missing values in data analysis, with particular attention to their management, including imputation. Following this discussion, the nascent Classification and Ranking Belief Simplex (CaRBS) system for data analysis (object classification) is presented which has the distinction of not requiring the a priori consideration (management) of any missing values present. Moreover, they are treated as ignorant values and retained in the analysis, a facet of CaRBS being associated with the notion of uncertain reasoning. A problem on the classification of standard and economy food products is considered, with knowledge on their inherent nutrient levels used in their discernment. The visualisation of the intermediate and final results offered by the CaRBS system allows a clear demonstration of the effects of the presence of missing values, within an object classification context.
Our main focus in this and the next chapter is the application of Cooperative Game Theory (CGT) models to international water resource issues. In this chapter we will justify the use of CGT in water resource problems, and in particular, in international conflict-cooperation cases. The chapter reviews several important CGT concepts and demonstrates their use and calculation. After reading this chapter you will have a good grasp of basic CGT concepts and be able to apply them at both conceptual and empirical levels to simple cases.