Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In many real-world applications, it is often the case that the class distribution of instances is imbalanced and the costs of misclassification are different. Thus, the class-imbalanced cost-sensitive learning has attracted much attention from researchers. Sampling is one of the widely used techniques in dealing with the class-imbalance problem, which alters the class distribution of instances so that the minority class is well represented in the training data. In this paper, we propose a novel Minority Cloning Technique (MCT) for class-imbalanced cost-sensitive learning. MCT alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class. The experimental results on a large number of UCI datasets show that MCT performs much better than Minority Oversampling with Replacement Technique (MORT) and Synthetic Minority Oversampling TEchnique (SMOTE) in terms of the total misclassification costs of the built classifiers.
Decision-tree algorithms are known to be unstable: small variations in the training set can result in different trees and different predictions for the same validation examples. Both accuracy and stability can be improved by learning multiple models from bootstrap samples of training data, but the "meta-learner" approach makes the extracted knowledge hardly interpretable. In the following paper, we present the Info-Fuzzy Network (IFN), a novel information-theoretic method for building stable and comprehensible decision-tree models. The stability of the IFN algorithm is ensured by restricting the tree structure to using the same feature for all nodes of the same tree level and by the built-in statistical significance tests. The IFN method is shown empirically to produce more compact and stable models than the "meta-learner" techniques, while preserving a reasonable level of predictive accuracy.
This paper sheds light on the changes suffered in cryptocurrencies due to the COVID-19 shock through a nonlinear cross-correlations and similarity perspective. We have collected daily price and volume data for the seven largest cryptocurrencies considering trade volume and market capitalization. For both attributes (price and volume), we calculate their volatility and compute the Multifractal Detrended Cross-Correlations (MF-DCCA) to estimate the complexity parameters that describe the degree of multifractality of the underlying process. We detect (before and during COVID-19) a standard multifractal behavior for these volatility time series pairs and an overall persistent long-term correlation. However, multifractality for price volatility time series pairs displays more persistent behavior than the volume volatility time series pairs. From a financial perspective, it reveals that the volatility time series pairs for the price are marked by an increase in the nonlinear cross-correlations excluding the pair Bitcoin versus Dogecoin (αxy(0)=−1.14%). At the same time, all volatility time series pairs considering the volume attribute are marked by a decrease in the nonlinear cross-correlations. The K-means technique indicates that these volatility time series for the price attribute were resilient to the shock of COVID-19. While for these volatility time series for the volume attribute, we find that the COVID-19 shock drove changes in cryptocurrency groups.
We address the challenge of classifying financial time series via a newly proposed multiscale symbolic phase transfer entropy (MSPTE). Using MSPTE method, we succeed to quantify the strength and direction of information flow between financial systems and classify financial time series, which are the stock indices from Europe, America and China during the period from 2006 to 2016 and the stocks of banking, aviation industry and pharmacy during the period from 2007 to 2016, simultaneously. The MSPTE analysis shows that the value of symbolic phase transfer entropy (SPTE) among stocks decreases with the increasing scale factor. It is demonstrated that MSPTE method can well divide stocks into groups by areas and industries. In addition, it can be concluded that the MSPTE analysis quantify the similarity among the stock markets. The symbolic phase transfer entropy (SPTE) between the two stocks from the same area is far less than the SPTE between stocks from different areas. The results also indicate that four stocks from America and Europe have relatively high degree of similarity and the stocks of banking and pharmaceutical industry have higher similarity for CA. It is worth mentioning that the pharmaceutical industry has weaker particular market mechanism than banking and aviation industry.
Often most software development doesn’t start from scratch but applies previously developed artifacts. These reusable artifacts are involved in various phases of the software life cycle, ranging from requirements to maintenance. Software design as the high level of software development process has an important impact on the following stages, so its reuse is gaining more and more attention. Unified modeling language (UML) class diagram as a modeling tool has become a de facto standard of software design, and thus its reuse also becomes a concern accordingly. So far, the related research on the reuse of UML class diagrams has focused on matching and retrieval. As a large number of class diagrams enter the repository for reuse, classification has become an essential task. The classification is divided into unsupervised classification (also known as clustering) and supervised classification. In our previous work, we discussed the clustering of UML class diagrams. In this paper, we focus on only the supervised classification of UML class diagrams and propose a supervised classification method. A novel ensemble classifier F-KNB combining both dependent and independent construction ideas is built. The similarity of class diagrams is described, in which the semantic, structural and hybrid matching is defined, respectively. The extracted feature elements are used in base classifiers F-KNN and F-NBs that are constructed based on improved K-nearest neighbors (KNNs) and Naive Bayes (NB), respectively. A series of experimental results show that the proposed ensemble classifier F-KNB shows a good classification quality and efficiency under the condition of variable size and distribution of training samples.
Concrete gravity dams are a common form of dams. The dynamic response analysis of concrete gravity dams under the action of underwater explosions is indispensable for survivability assessments. Due to the constraints of the environment and other conditions, an underwater explosion test of a prototype concrete gravity dam is difficult to achieve. The centrifuge scaled-down test that satisfies similarity theory provides a new way to study the dynamic response of underwater explosions on concrete gravity dams. In a near-field explosion, the shock wave of an underwater explosion will cause direct damage to the gravity dam, and the detonation products after the shock wave will aggravate the damage. Under the premise of verifying the concrete dynamic response material parameters and the fluid–solid coupling model of the underwater explosion, numerical calculations of the scaled underwater explosion model of the centrifuge and the prototype underwater explosion were carried out, and the similarity between the scaled dam response model of the underwater explosion centrifuge and the prototype was discussed. The analysis shows that the scaled-down centrifuge model can better reflect the local damage of the prototype dam subjected to shock waves. The difference between the scaled centrifuge model and the prototype model of underwater explosion damage to the dam is concentrated on the overall damage. The damage effect of detonation products on the prototype dam body is mainly concentrated in the middle and lower parts of the dam body, while the damage effect of detonation products on the dam body in the scaled-down model is mainly concentrated in the middle and upper parts of the dam body.
Mathematical equations are now found not only in the books, but also they help in finding solutions for the biological problems by explaining the technicality of the current biological models and providing predictions that can be validated and complemented to experimental and clinical studies. In this research paper, we use the mset theory to study DNA & RNA mutations to discover the mutation occurrence. Also, we use the link between the concept of the mset and topology to determine the compatibility or similarity between “types”, which may be the strings of bits, vectors, DNA or RNA sequences, etc.
In this paper, we present an analysis of oracle bone characters for animals from a “cognitive” point of view. After some general remarks on oracle-bone characters presented in Sec. 1 and a short outline of the paper in Sec. 2, we collect various oracle-bone characters for animals from published resources in Sec. 3. In the next section, we begin analyzing a group of 60 ancient animal characters from www.zdic.net, a highly acclaimed internet dictionary of Chinese characters that is strictly based on historical sources, and introduce five categories of specific features regarding their (graphical) structure that will be used in Sec. 5 to associate corresponding feature vectors to these characters. In Sec. 6, these feature vectors will be used to investigate their dissimilarity in terms of a family of parameterized distance measures. And in the last section, we apply the SplitsTree method as encoded in the NeighborNet algorithms to construct a corresponding family of dissimilarity-based networks with the intention of elucidating how the ancient Chinese might have perceived the “animal world” in the late bronze age and to demonstrate that these pictographs reflect an intuitive understanding of this world and its inherent structure that predates its classification in the oldest surviving Chinese encyclopedia from approximately the third century BC, the Er Ya, as well as similar classification systems in the West by one to two millennia. We also present an English dictionary of 70 oracle bone characters for animals in Appendix A. In Appendix B, we list various variants of animal characters that were published in the Jia Gu Wen Bian (cf. 甲骨文编, A Complete Collection of Oracle Bone Characters, edited by the Institute of Archaeology of the Chinese Academy of Social Sciences, published by the Zhonghua Book Company in 1965). We recall the frequencies of the 521 most frequent oracle bone characters in Appendix C as reported in [T. Chen, Yin-Shang Jiaguwen Zixing Xitong Zai Yanjiu, (The Structural System of Oracle Inscriptions) (Shanghai Renmin Chubanshe, Shanghai, 2010); Jiaguwen Shiwen Yongzi Pinlü Biao (A Frequency List of Oracle Characters), Center for the Study and Application of Chinese Characters (East China Normal University, Shanghai, 2010), http://www.wenzi.cn/en/default.aspx. And in Appendix D, we list the animals registered in the last five chapters of the Er Ya.
In the world of the Internet of Things (IoT), heterogeneous systems and devices need to be connected and exchange data with others. How data exchange can be automatically realized becomes a critical issue. An information model (IM) is frequently adopted and utilized to solve the data interoperability problem. Meanwhile, as IoT systems and devices can have different IMs with different modeling methodologies and formats such as UML, IEC 61360, etc., automated data interoperability based on various IMs is recognized as an urgent problem. In this paper, we propose an approach to automate the data interoperability, i.e. data exchange among similar entities in different IMs. First, similarity scores among entities are calculated based on their syntactic and semantic features. Then, in order to precisely get similar candidates to exchange data, a concept of class distance calculated with a Virtual Distance Graph (VDG) is proposed to narrow down obtained similar properties for data exchange. Through analyzing the results of a case study, the class distance based on VDG can effectively improve the precisions of calculated similar properties. Furthermore, data exchange rules can be generated automatically. The results reveal that the approach of this research can efficiently contribute to resolving the data interoperability problem.
In geometry group theory, one of the milestones is Gromov’s polynomial growth theorem: Finitely generated groups have polynomial growth if and only if they are virtually nilpotent. Inspired by Gromov’s work, we introduce the growth types of weighted Hardy spaces. In this paper, we focus on the weighted Hardy spaces of polynomial growth, which cover the classical Hardy space, weighted Bergman spaces, weighted Dirichlet spaces and much broader. Our main results are as follows. (1) We obtain the boundedness of the composition operators with symbols of analytic automorphisms of unit open disk acting on weighted Hardy spaces of polynomial growth, which implies the multiplication operator Mz is similar to Mφ for any analytic automorphism φ on the unit open disk. Moreover, we obtain the boundedness of composition operators induced by analytic functions on the unit closed disk on weighted Hardy spaces of polynomial growth. (2) For any Blaschke product B of order m, MB is similar to ⊕m1Mz, which is an affirmative answer to a generalized version of a question proposed by Douglas in 2007. (3) We also give counterexamples to show that the composition operators with symbols of analytic automorphisms of unit open disk acting on a weighted Hardy space of intermediate growth could be unbounded, which indicates the necessity of the setting of polynomial growth condition. Then, the collection of weighted Hardy spaces of polynomial growth is almost the largest class such that Douglas’s question has an affirmative answer. (4) Finally, we give the Jordan representation theorem and similarity classification for the analytic functions on the unit closed disk as multiplication operators on a weighted Hardy space of polynomial growth.
For the limitation that current node influence ranking algorithms can only be applied in a single type of network and the results are inaccurate, an algorithm based on similarity is proposed. When a node is similar to many nodes in the network, it is representative and can be treated as an influential node. Firstly, probability walking model is used to simulate the initiative visit between nodes in different types of networks. Secondly, superposed probabilistic transfer similarity is defined based on the model considering nodes’ inbound and outbound information. Finally, node ranking algorithm is set up using the new similarity measuring method. Experiments show that the algorithm can evaluate different kinds of networks with high accuracy, whether the network is directed or undirected, weighted or unweighted.
Detecting the natural communities in a real-world network can uncover its underlying structure and potential function. In this paper, a novel community algorithm SUM is introduced. The fundamental idea of SUM is that a node with relatively low degree stays faithful to its community, because it only has links with nodes in one community, while a node with relatively high degree not only has links with nodes within but also outside its community, and this may cause confusion when detecting communities. Based on this idea, SUM detects communities by suspecting the links of the maximum degree nodes to their neighbors within a community, and relying mainly on the nodes with relatively low degree simultaneously. SUM elegantly defines a similarity which takes into account both the commonality and the rejective degree of two adjacent nodes. After putting similar nodes into one community, SUM generates initial communities by reassigning the maximum degree nodes. Next, SUM assigns nodes without labels to the initial communities, and adjusts the border node to its most linked community. To evaluate the effectiveness of SUM, SUM is compared with seven baselines, including four classical and three state-of-the-art methods on a wide range of complex networks. On the small size networks with ground-truth community structures, results are visually demonstrated, as well as quantitatively measured with ARI, NMI and Modularity. On the relatively large size networks without ground-truth community structures, the performances of these algorithms are evaluated according to Modularity. Experimental results indicate that SUM can effectively determine community structures on small or relatively large size networks with high quality, and also outperforms the compared state-of-the-art methods.
Improving naive Bayes (simply NB)15,28 for classification has received significant attention. Related work can be broadly divided into two approaches: eager learning and lazy learning.1 Different from eager learning, the key idea for extending naive Bayes using lazy learning is to learn an improved naive Bayes for each test instance. In recent years, several lazy extensions of naive Bayes have been proposed. For example, LBR,30 SNNB,27 and LWNB.8 All these algorithms aim to improve naive Bayes' classification performance. Indeed, they achieve significant improvement in terms of classification, measured by accuracy. In many real-world data mining applications, however, an accurate ranking is more desirable than an accurate classification. Thus a natural question is whether they also achieve significant improvement in terms of ranking, measured by AUC (the area under the ROC curve).2,11,17 Responding to this question, we conduct experiments on the 36 UCI data sets18 selected by Weka12 to investigate their ranking performance and find that they do not significantly improve the ranking performance of naive Bayes. Aiming at scaling up naive Bayes' ranking performance, we present a novel lazy method ICNB (instance cloned naive Bayes) and develop three ICNB algorithms using different instance cloning strategies. We empirically compare them with naive Bayes. The experimental results show that our algorithms achieve significant improvement in terms of AUC. Our research provides a simple but effective method for the applications where an accurate ranking is desirable.
A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein–protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.
Let E2 be the 2-dimensional Euclidean space, LSim(2) be the group of all linear similarities of E2 and LSim+(2) be the group of all orientation-preserving linear similarities of E2. The present paper is devoted to solutions of problems of global G-equivalence of paths and curves in E2 for the groups G=LSim(2),LSim+(2). Complete systems of global G-invariants of a path and a curve in E2 are obtained. Existence and uniqueness theorems are given. Evident forms of a path and a curve with the given global invariants are obtained.
Machine learning methods, such as neural network (NN) and support vector machine, assume that the training data and the test data are drawn from the same distribution. This assumption may not be satisfied in many real world applications, like long-term financial failure prediction, because the training and test data may each come from different time periods or domains. This paper proposes a novel algorithm known as fuzzy bridged refinement-based domain adaptation to solve the problem of long-term prediction. The algorithm utilizes the fuzzy system and similarity concepts to modify the target instances' labels which were initially predicted by a shift-unaware prediction model. The experiments are performed using three shift-unaware prediction models based on nine different settings including two main situations: (1) there is no labeled instance in the target domain; (2) there are a few labeled instances in the target domain. In these experiments bank failure financial data is used to validate the algorithm. The results demonstrate a significant improvement in the predictive accuracy, particularly in the second situation identified above.
Metric learning is a critical problem in classification. Most classifiers are based on a metric, the simplest one is the KNN classifier, whose outcome is directly decided by the given metric. This paper will discuss semi-supervised metric learning. Most traditional semi-supervised metric learning algorithms preserve the local structure of all the samples (including labeled and unlabeled) in the input space, when making the same labeled samples together and separating different labeled samples. In most existing methods, the local structure is calculated by the Euclidean distance which uses all the features. As we all know, high dimensional data lies on a low dimension manifold, and not all the features are discriminative. Thus, in this paper, we try to explore the latent structure of the samples and use the more discriminative features to calculate the local structure. The latent structure is learned by clustering random forest and cast into similarity between samples. Based on the hierarchical structure of the trees and the split function, the similarity is obtained from discriminant features. Experimental results on public data sets show our algorithm outperforms the traditional similar related algorithms.
The paper presents a contribution to minimization of fuzzy automata. Traditionally, the problem of minimization of fuzzy automata results directly from the problem of minimization of ordinary automata. That is, given a fuzzy automaton, describe an automaton with the minimal number of states which recognizes the same language as the given one. In this paper, we formulate a different problem. Namely, the minimal fuzzy automaton we are looking for is required to recognize a language which is similar to the language of the given fuzzy automaton to a certain degree a, such as a = 0.9, prescribed by a user. That is, we relax the condition of being equal to a weaker condition of being similar to degree a.
Texture image segmentation is the first essential and important step of low level vision. The extraction of texture features is a most fundamental problem to texture image segmentation. Many methods for extracting texture features of the image have been proposed, such as statistical features, co-occurrence features, two-dimensional AR features, and fractal based features etc. In this paper, a new method for extracting texture features of the image is proposed. In this method, the gray scale image is first decomposed into a series of binary images by variable thresholds, and then topological features of all of these binary images are computed. Using these topological features as texture features, we apply a pyramid linking with band-pass filter neural networks to segment the texture image into some homogeneous areas. Several experiments on synthetic texture images have been carried out to verify the efficacy of the new method.