For a considerable amount of time, many market participants have placed great importance on price forecasts for major metal commodities. To tackle the problem, our study looks at the price of copper recorded on a daily basis. The sample under inquiry spans more than 10 years, from 01/02/2014 to 04/12/2024, and the price series under examination has substantial financial implications. In this instance, Gaussian process regression models are built using cross-validation techniques and Bayesian optimization methodologies, and the resultant strategies are employed to provide price estimations. The relative root-mean-square error of 1.3880% indicates that our empirical prediction approach produces reasonably accurate price estimates for the out-of-sample assessment period of 04/11/2022–04/12/2024. Models for predicting prices give investors and governments the information they require to make wise choices about the copper market.
Traditional training methods need to collect a large amount of data for every subject to train a subject-specific classifier, which causes subjects fatigue and training burden. This study proposes a novel training method, TrAdaBoost based on cross-validation and an adaptive threshold (CV-T-TAB), to reduce the amount of data required for training by selecting and combining multiple subjects’ classifiers that perform well on a new subject to train a classifier. This method adopts cross-validation to extend the amount of the new subject’s training data and sets an adaptive threshold to select the optimal combination of the classifiers. Twenty-five subjects participated in the N200- and P300-based brain–computer interface. The study compares CV-T-TAB to five traditional training methods by testing them on the training of a support vector machine. The accuracy, information transfer rate, area under the curve, recall and precision are used to evaluate the performances under nine conditions with different amounts of data. CV-T-TAB outperforms the other methods and retains a high accuracy even when the amount of data is reduced to one-third of the original amount. The results imply that CV-T-TAB is effective in improving the performance of a subject-specific classifier with a small amount of data by adopting multiple subjects’ classifiers, which reduces the training cost.
In information processing systems for classification and regression tasks, global parameters are often introduced to balance the prior expectation about the processed data and the emphasis on reproducing the training data. Since over-emphasizing either of them leads to poor generalization, optimal global parameters are needed. Conventionally, a time-consuming cross-validation procedure is used. Here we introduce a novel approach to this problem, based on the Green's function. All estimations can be made empirically and hence can be easily extended to more complex systems. The method is fast since it does not require the validation step. Its performances on benchmark data sets are very satisfactory.
In a classification problem, quite often the dimension of the measurement vector is large. Some of these measurements may not be important for separating the classes. Removal of these measurement variables not only reduces the computational cost but also leads to better understanding of class separability. There are some methods in the existing literature for reducing the dimensionality of a classification problem without losing much of the separability information. However, these dimension reduction procedures usually work well for linear classifiers. In the case where competing classes are not linearly separable, one has to look for ideal "features" which could be some transformations of one or more measurements. In this paper, we make an attempt to tackle both, the problems of dimension reduction and feature extraction, by considering a projection pursuit regression model. The single hidden layer perceptron model and some other popular models can be viewed as special cases of this model. An iterative algorithm based on backfitting is proposed to select the features dynamically, and cross-validation method is used to select the ideal number of features. We carry out an extensive simulation study to show the effectiveness of this fully automatic method.
We define the problem of optimizing the architecture of a multilayer perceptron (MLP) as a state space search and propose the MOST (Multiple Operators using Statistical Tests) framework that incrementally modifies the structure and checks for improvement using cross-validation. We consider five variants that implement forward/backward search, using single/multiple operators and searching depth-first/breadth-first. On 44 classification and 30 regression datasets, we exhaustively search for the optimal and evaluate the goodness based on: (1) Order, the accuracy with respect to the optimal and (2) Rank, the computational complexity. We check for the effect of two resampling methods (5 × 2, ten-fold cv), four statistical tests (5 × 2 cv t, ten-fold cv t, Wilcoxon, sign) and two corrections for multiple comparisons (Bonferroni, Holm). We also compare with Dynamic Node Creation (DNC) and Cascade Correlation (CC). Our results show that: (1) On most datasets, networks with few hidden units are optimal, (2) forward searching finds simpler architectures, (3) variants using single node additions (deletions) generally stop early and get stuck in simple (complex) networks, (4) choosing the best of multiple operators finds networks closer to the optimal, (5) MOST variants generally find simpler networks having lower or comparable error rates than DNC and CC.
The paper describes an integrated recognition-by-parts architecture for reliable and robust face recognition. Reliability and robustness are characteristic of the ability to deploy full-fledged and operational biometric engines, and handling adverse image conditions that include among others uncooperative subjects, occlusion, and temporal variability, respectively. The architecture proposed is model-free and non-parametric. The conceptual framework draws support from discriminative methods using likelihood ratios. At the conceptual level it links forensics and biometrics, while at the implementation level it links the Bayesian framework and statistical learning theory (SLT). Layered categorization starts with face detection using implicit rather than explicit segmentation. It proceeds with face authentication that involves feature selection of local patch instances including dimensionality reduction, exemplar-based clustering of patches into parts, and data fusion for matching using boosting driven by parts that play the role of weak-learners. Face authentication shares the same implementation with face detection. The implementation, driven by transduction, employs proximity and typicality (ranking) realized using strangeness and p-values, respectively. The feasibility and reliability of the proposed architecture are illustrated using FRGC data. The paper concludes with suggestions for augmenting and enhancing the scope and utility of the proposed architecture.
Leave-one-out (LOO) and its generalization, K-Fold, are among most well-known cross-validation methods, which divide the sample into many folds, each one of which is, in turn, left out for testing, while the other parts are used for training. In this study, as an extension of this idea, we propose a new cross-validation approach that we called miss-one-out (MOO) that mislabels the example(s) in each fold and keeps this fold in the training set as well, rather than leaving it out as LOO does. Then, MOO tests whether the trained classifier can correct the erroneous label of the training sample. In principle, having only one fold deliberately labeled incorrectly should have only a small effect on the classifier that uses this bad-fold along with K - 1 good folds and can be utilized as a generalization measure of the classifier. Experimental results on a number of benchmark datasets and three real bioinformatics dataset show that MOO can better estimate the test set accuracy of the classifier.
It is well known that the performance of kernel methods depends on the choice of appropriate kernels and associated parameters. While cross-validation (CV) is a useful method of kernel and parameter choice for supervised learning such as the support vector machines, there are no general well-founded methods for unsupervised kernel methods. This paper discusses CV for kernel canonical correlation analysis (KCCA), and proposes a new regularization approach for KCCA. As we demonstrate with Gaussian kernels, the CV errors for KCCA tend to decrease as the bandwidth parameter of the kernel decreases, which provides inappropriate features with all the data concentrated in a few points. This is caused by the ill-posedness of the KCCA with the CV. To solve this problem, we propose to use constraints on the fourth-order moments of canonical variables in addition to the variances. Experiments on synthesized and real-world data demonstrate that the proposed higher-order regularized KCCA can be applied effectively with the CV to find appropriate kernel and regularization parameters.
Classification is an important field in machine learning and pattern recognition. Amongst various types of classifiers such as nearest neighbor, neural network and Bayesian classifiers, support vector machine (SVM) is known as a very powerful classifier.
One of the advantages of SVM in comparison with the other methods, is its efficient and adjustable generalization capability. The performance of SVM classifier depends on its parameters, specially regularization parameter C, that is usually selected by cross-validation. Despite its generalization, SVM suffers from some limitations such as its considerable low speed training phase. Cross-validation is a very time consuming part of training phase, because for any candidate value of the parameter C, the entire process of training and validating must be repeated completely.
In this paper, we propose a novel approach for early stopping of the SVM learning algorithm. The proposed early stopping occurs by integrating the validation part into the optimization part of the SVM training without losing any generality or degrading performance of the classifier. Moreover, this method can be considered in conjunction with the other available accelerator methods since there is not any dependency between our proposed method and the other accelerator ones, thus no redundancy will happen. Our method was tested and verified on various UCI repository datasets and the results indicate that this method speeds up the learning phase of SVM without losing any generality or affecting the final model of classifier.
Classification of the communication signals has seen under increasing demands. In this paper, we present a new technique that identifies a variety of digital communication signal types. This technique utilizes a radial basis function neural network (RBFN) as the classifier. Swarm intelligence, as an evolutionary algorithm, is used to construct RBFN. A combination of the higher-order moments and the higher-order cumulants up to eight are selected as the features of the considered digital signal types. In conjunction with RBFN, we have used k-fold cross-validation to improve the generalization potentiality. Simulation results show that the proposed technique has high performance for classification of different communication signals even at very low signal-to-noise ratios.
Estimating software fault-proneness early, i.e., predicting the probability of software modules to be faulty, can help in reducing costs and increasing effectiveness of software analysis and testing. The many available static metrics provide important information, but none of them can be deterministically related to software fault-proneness. Fault-proneness models seem to be an interesting alternative, but the work on these is still biased by lack of experimental validation.
This paper discusses barriers and problems in using software fault-proneness in industrial environments, proposes a method for building software fault-proneness models based on logistic regression and cross-validation that meets industrial needs, and provides some experimental evidence of the validity of the proposed approach.
Biological data produced by high throughput technologies are becoming more and more abundant and are arousing many statistical questions. This paper addresses one of them; when gene expression data are jointly observed with other variables with the purpose of highlighting significant relationships between gene expression and these other variables. One relevant statistical method to explore these relationships is Canonical Correlation Analysis (CCA). Unfortunately, in the context of postgenomic data, the number of variables (gene expressions) is usually greater than the number of units (samples) and CCA cannot be directly performed: a regularized version is required.
We applied regularized CCA on data sets from two different studies and show that its interpretation evidences both previously validated relationships and new hypothesis. From the first data sets (nutrigenomic study), we generated interesting hypothesis on the transcription factor pathways potentially linking hepatic fatty acids and gene expression. From the second data sets (pharmacogenomic study on the NCI-60 cancer cell line panel), we identified new ABC transporter candidate substrates which relevancy is illustrated by the concomitant identification of several known substrates.
In conclusion, the use of regularized CCA is likely to be relevant to a number and a variety of biological experiments involving the generation of high throughput data. We demonstrated here its ability to enhance the range of relevant conclusions that can be drawn from these relatively expensive experiments.
Regression analysis estimates the relationships among variables which has been widely used in growth curves, and cross-validation as a model selection method assesses the generalization ability of regression models. Classical methods assume that the observation values of variables are precise numbers while in many cases data are imprecisely collected. So this paper explores the Chapman-Richards growth model which is one of the widely used growth models with imprecise observations under the framework of uncertainty theory. The least squares estimates of unknown parameters in this model are given. Moreover, cross-validation with imprecise observations is proposed. Furthermore, estimates of the expected value and variance of the uncertain error using residuals are given. In addition, ways to predict the value of response variable with new observed values of predictor variables are discussed. Finally, a numerical example illustrates our approach.
Prediction of biological functions of genes is an important issue in basic biology research and has applications in drug discoveries and gene therapies. Previous studies have shown either gene expression data or protein-protein interaction data alone can be used for predicting gene functions. In particular, clustering gene expression profiles has been widely used for gene function prediction. In this paper, we first propose a new method for gene function prediction using protein-protein interaction data, which will facilitate combining prediction results based on clustering gene expression profiles. We then propose a new method to combine the prediction results based on either source of data by weighting on the evidence provided by each. Using protein-protein interaction data downloaded from the GRID database, published gene expression profiles from 300 microarray experiments for the yeast S. cerevisiae, we show that this new combined analysis provides improved predictive performance over that of using either data source alone in a cross-validated analysis of the MIPS gene annotations. Finally, we propose a logistic regression method that is flexible enough to combine information from any number of data sources while maintaining computational feasibility.
The study of interactions between host and pathogen proteins is important for understanding the underlying mechanisms of infectious diseases and for developing novel therapeutic solutions. Wet-lab techniques for detecting protein–protein interactions (PPIs) can benefit from computational predictions. Machine learning is one of the computational approaches that can assist biologists by predicting promising PPIs. A number of machine learning based methods for predicting host–pathogen interactions (HPI) have been proposed in the literature. The techniques used for assessing the accuracy of such predictors are of critical importance in this domain. In this paper, we question the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions. K-fold cross-validation does not model this scenario, and we demonstrate a sizable difference between its performance and the performance of an alternative evaluation scheme called leave one pathogen protein out (LOPO) cross-validation. LOPO is more effective in modeling the real world use of HPI predictors, specifically for cases in which no information about the interacting partners of a pathogen protein is available during training. We also point out that currently used metrics such as areas under the precision-recall or receiver operating characteristic curves are not intuitive to biologists and propose simpler and more directly interpretable metrics for this purpose.
Some biomedical datasets contain a small number of samples which have large numbers of features. This can make analysis challenging and prone to errors such as overfitting and misinterpretation. To improve the accuracy and reliability of analysis in such cases, we present a tutorial that demonstrates a mathematical approach for a supervised two-group classification problem using two medical datasets. A tutorial provides insights on effectively addressing uncertainties and handling missing values without the need for removing or inputting additional data. We describe a method that considers the size and shape of feature distributions, as well as the pairwise relations between measured features as separate derived features and prognostic factors. Additionally, we explain how to perform similarity calculations that account for the variation in feature values within groups and inaccuracies in individual value measurements. By following these steps, a more accurate and reliable analysis can be achieved when working with biomedical datasets that have a small sample size and multiple features.
This paper proposes a quantum-classical algorithm to evaluate and select classical artificial neural networks architectures. The proposed algorithm is based on a probabilistic quantum memory (PQM) and the possibility to train artificial neural networks (ANN) in superposition. We obtain an exponential quantum speedup in the evaluation of neural networks. We also verify experimentally through a reduced experimental analysis that the proposed algorithm can be used to select near-optimal neural networks.
The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.
In this paper, we propose an effective convolutional neural network (CNN) model to the problem of face recognition. The proposed CNN architecture applies fused convolution/subsampling layers that result in a simpler model with fewer network parameters; that is, a smaller number of neurons, trainable parameters, and connections. In addition, it does not require any complex or costly image preprocessing steps that are typical in existing face recognizer systems. In this work, we enhance the stochastic diagonal Levenberg–Marquardt algorithm, a second-order back-propagation algorithm to obtain faster network convergence and better generalization ability. Experimental work completed on the ORL database shows that a recognition accuracy of 100% is achieved, with the network converging within 15 epochs. The average processing time of the proposed CNN face recognition solution, executed on a 2.5 GHz Intel i5 quad-core processor, is 3 s per epoch, with a recognition speed of less than 0.003 s. These results show that the proposed CNN model is a computationally efficient architecture that exhibits faster processing and learning times, and also produces higher recognition accuracy, outperforming other existing work on face recognizers based on neural networks.
Traditional time analysis deals with observations in chronological order assuming the observations are precise numbers under the framework of probability theory, whereas data are imprecisely collected in many cases. This paper characterizes the imprecisely observed data as uncertain variables and estimates the unknown parameters in the uncertain autoregressive model using Huber loss function, which is more flexible compared with other robust estimations for a pre-given k that regulates the amount of robustness. Then prediction value and prediction interval of the future value are given. What is more, a method to choose k by cross-validation is proposed. At last, numerical examples show our methods in detail and illustrate the robustness of Huber estimation by comparing it with the least square estimation. Our methods are also applied to a set of real data with carbon dioxide concentrations.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.