You do not have any saved searches
Unsupervised image clustering is a challenging task in computer vision. Recently, various deep clustering algorithms based on contrastive learning have achieved promising performance and some distinguishable features representation were obtained only by taking different augmented views of same image as positive pairs and maximizing their similarities, whereas taking other images’ augmentations in the same batch as negative pairs and minimizing their similarities. However, due to the fact that there is more than one image in a batch belong to the same class, simply pushing the negative instances apart will result in inter-class conflictions and lead to the clustering performance degradation. In order to solve this problem, we propose a deep clustering algorithm based on supported nearest neighbors (SNDC), which constructs positive pairs of current images by maintaining a support set and find its k nearest neighbors from the support set. By going beyond single instance positive, SNDC can learn more generalized features representation with inherent semantic meaning and therefore alleviating inter-class conflictions. Experimental results on multiple benchmark datasets show that the performance of SNDC is superior to the state-of-the-art clustering models, with accuracy improvement of 6.2% and 20.5% on CIFAR-10 and ImageNet-Dogs respectively.
Zero-shot detection (ZSD) aims to locate and classify unseen objects in pictures or videos by semantic auxiliary information without additional training examples. Most of the existing ZSD methods are based on two-stage models, which achieve the detection of unseen classes by aligning object region proposals with semantic embeddings. However, these methods have several limitations, including poor region proposals for unseen classes, lack of consideration of semantic representations of unseen classes or their inter-class correlations, and domain bias towards seen classes, which can degrade overall performance. To address these issues, the Trans-ZSD framework is proposed, which is a transformer-based multi-scale contextual detection framework that explicitly exploits inter-class correlations between seen and unseen classes and optimizes feature distribution to learn discriminative features. Trans-ZSD is a single-stage approach that skips proposal generation and performs detection directly, allowing the encoding of long-term dependencies at multiple scales to learn contextual features while requiring fewer inductive biases. Trans-ZSD also introduces a foreground–background separation branch to alleviate the confusion of unseen classes and backgrounds, contrastive learning to learn inter-class uniqueness and reduce misclassification between similar classes, and explicit inter-class commonality learning to facilitate generalization between related classes. Trans-ZSD addresses the domain bias problem in end-to-end generalized zero-shot detection (GZSD) models by using balance loss to maximize response consistency between seen and unseen predictions, ensuring that the model does not bias towards seen classes. The Trans-ZSD framework is evaluated on the PASCAL VOC and MS COCO datasets, demonstrating significant improvements over existing ZSD models.
Stroke patients are prone to fatigue during the EEG acquisition procedure, and experiments have high requirements on cognition and physical limitations of subjects. Therefore, how to learn effective feature representation is very important. Deep learning networks have been widely used in motor imagery (MI) based brain-computer interface (BCI). This paper proposes a contrast predictive coding (CPC) framework based on the modified s-transform (MST) to generate MST-CPC feature representations. MST is used to acquire the temporal-frequency feature to improve the decoding performance for MI task recognition. EEG2Image is used to convert multi-channel one-dimensional EEG into two-dimensional EEG topography. High-level feature representations are generated by CPC which consists of an encoder and autoregressive model. Finally, the effectiveness of generated features is verified by the k-means clustering algorithm. It can be found that our model generates features with high efficiency and a good clustering effect. After classification performance evaluation, the average classification accuracy of MI tasks is 89% based on 40 subjects. The proposed method can obtain effective feature representations and improve the performance of MI-BCI systems. By comparing several self-supervised methods on the public dataset, it can be concluded that the MST-CPC model has the highest average accuracy. This is a breakthrough in the combination of self-supervised learning and image processing of EEG signals. It is helpful to provide effective rehabilitation training for stroke patients to promote motor function recovery.
Over the past few years, graph contrastive learning (GCL) has gained great success in processing unlabeled graph-structured data, but most of the existing GCL methods are based on instance discrimination task which typically learns representations by minimizing the distance between two versions of the same instance. However, different from images, which are assumed to be independently and identically distributed, graphs present relational information among data instances, in which each instance is related to others by links. Furthermore, the relations are heterogeneous in many cases. The instance discrimination task cannot make full use of the relational information inherent in the graph-structured data. To solve the above-mentioned problems, this paper proposes a relation-aware graph contrastive learning method, called RGCL. Aiming to capture the most important heterogeneous relations in the graph, RGCL explicitly models the edges, and then pulls semantically similar pairs of edges together and pushes dissimilar ones apart with contrastive regularization. By exploiting the full potential of the relationship among nodes, RGCL overcomes the limitations of previous GCL methods based on instance discrimination. The experimental results demonstrate that the proposed method outperforms a series of graph contrastive learning frameworks on widely used benchmarks, which justifies the effectiveness of our work.
The embedding of Knowledge Graphs (KGs) in hyperbolic space has recently received great attention in the field of deep learning because it can provide more accurate and concise representations of hierarchical structures compared to Euclidean spaces and complex spaces. Although hyperbolic space embeddings have shown significant improvements over Euclidean spaces and complex space embeddings in handling the task of KG embedding, they still face challenges related to the uneven distribution and insufficient alignment of high-dimensional sparse data. To address this issue, we propose the CONHyperKGE model, which leverages contrastive learning to optimize the embedding distribution in hyperbolic space. This approach enables better capture of hierarchical structures, improved handling of symmetry, and enhanced treatment of sparse matrices. Our proposed method is evaluated on four standard KG Embedding (KGE) datasets: WN18RR, FB15k-237, Kinship, and UMLS. After extensive experimental verification, our method has improved its performance on all four datasets. Notably, on the low-dimensional Kinship dataset, our method achieves an average Mean Reciprocal Rank (MRR) improvement of 2% over the original method, while on the high-dimensional WN18RR dataset, an average MRR improvement of 1% is observed compared to the original method.
Sentiment analysis using scene text images is complex and challenging because it has an arbitrary background, and the method should rely on only visual features. Unlike most existing methods that use either text or images or both, this study uses only scene text images for sentiment analysis. The intuition to use only scene text images is that sometimes users express their feelings and emotions or convey their messages by writing text in different shapes with diverse background designs. It is noted that the existing methods ignore such vital cues for sentiment analysis. This work explores a vision transformer to extract visual features that represent contextual information about the appearance of the text image. Further, to strengthen the visual features, the proposed work introduces contrastive learning which maximizes the gap between inter-classes and minimizes the gap between intra-classes of positive, negative, and neutral. To demonstrate the effectiveness of the proposed method, it is tested on our own constructed dataset and benchmark dataset. A comparative study of our method with the existing method shows the proposed method is superior in the classification of positive, negative, and neutral scene text images.
Few-Shot Class Incremental Learning (FSCIL) is a trending topic in deep learning, addressing the need for models to incrementally learn novel classes, particularly in real-world scenarios where continuously emerging classes come with limited labeled samples. However, the majority of FSCIL research has been dedicated to image classification and object recognition tasks, with limited attention given to video action classification. In this paper, we present a new Cluster Compression and Generative Separation (CCGS) method for Incremental Few-Shot Video Action Recognition (iFSVAR), which introduces contrastive learning to boost the degree of class separation in the base session. Simultaneously, it creates numerous fine-grained classes with diverse semantics, effectively filling the unallocated representation space. Experimental results on UCF101, Kinetics, and Something-Something-V2demonstrate the effectiveness of the framework.
Automatic tracking of three-dimensional (3D) human motion pose has the potential to provide corresponding technical support in various fields. However, existing methods for tracking human motion pose suffer from significant errors, long tracking times and suboptimal tracking results. To address these issues, an automatic tracking method for 3D human motion pose using contrastive learning is proposed. By using the feature parameters of 3D human motion poses, threshold variation parameters of 3D human motion poses are computed. The golden section is introduced to transform the threshold variation parameters and extract the features of 3D human motion poses by comparing the feature parameters with the threshold of parameter variation. Under the supervision of contrastive learning, a constraint loss is added to the local–global deep supervision module of contrastive learning to extract local parameters of 3D human motion poses, combined with their local features. After normalizing the 3D human motion pose images, frame differences of the background image are calculated. By constructing an automatic tracking model for 3D human motion poses, automatic tracking of 3D human motion poses is achieved. Experimental results demonstrate that the highest tracking lag is 9%, there is no deviation in node tracking, the pixel contrast is maintained above 90% and only 6 sub-blocks have detail loss. This indicates that the proposed method effectively tracks 3D human motion poses, tracks all the nodes, achieves high accuracy in automatic tracking and produces good tracking results.
In Natural Language Processing (NLP), it is common for one entity to contain another entity, i.e., nested entities. However, the most commonly used methods can only handle flat entities but not nested entities. To solve this problem, this paper proposes a flat-span contrastive learning method for nested Named Entity Recognition (NER), which consists of two sub-modules: a flat NER module and a candidate span classification module. The flat NER module is used to recognize the outermost entities, and we use star-transformer to capture the long-range dependencies of sentences, and the Conditional Random Field (CRF) to decode the outermost entity spans, contrastive learning is introduced, and the InfoNEC loss function is used to increase the difference between entity spans and nonentity spans. Finally, to improve the model performance and reduce error propagation, we jointly train the flat NER and candidate span classification modules through multi-task learning. Experimental results on the GENIA, GermEval2014, and JNLPBA datasets thoroughly verify the effectiveness of our model, and the ablation experiments further demonstrate the effectiveness of the model components. In the candidate span classification module, we generate all possible candidate spans based on the outermost entities by enumeration to better distinguish entity spans from nonentity.
Classic approaches to content moderation typically apply a rule-based heuristic approach to flag content. While rules are easily customizable and intuitive for humans to interpret, they are inherently fragile and lack the flexibility or robustness needed to moderate the vast amount of undesirable content found online today. Recent advances in deep learning have demonstrated the promise of using highly effective deep neural models to overcome these challenges. However, despite the improved performance, these data-driven models lack transparency and explainability, often leading to mistrust from everyday users and a lack of adoption by many platforms. In this paper, we present Rule By Example (RBE): a novel exemplar-based contrastive learning approach for learning from logical rules for the task of textual content moderation. RBE is capable of providing rule-grounded predictions, allowing for more explainable and customizable predictions compared to typical deep learning-based approaches. We demonstrate that our approach is capable of learning rich rule embedding representations using only a few data examples. Experimental results on three popular hate speech classification datasets show that RBE is able to outperform state-of-the-art deep learning classifiers as well as the use of rules in both supervised and unsupervised settings while providing explainable model predictions via rule-grounding.
Entity Alignment (EA) aims to identify entities representing the same entity in the real world between two knowledge graphs. Recently, entity embedding-based models become the mainstream models of EA. But these models have the following shortcomings: (1) the ratio of seed alignments seriously affects the performance of EA and the acquisition of them often requires a lot of labor costs (2) entity embeddings don’t take into account the differences of different entities. To address these problems, an entity embedding-based model via contrastive learning is proposed for EA between KGs without utilizing prealigned seed entity pairs, which not only integrates entity attribute information in entity embeddings, but also enhances the discrimination between different entity embeddings. Experimental results on two real-world knowledge bases show that our proposed model has achieved a good improvement in the three common metrics for the entity alignment task, i.e., hits@1, hits@10, and MR.
Although protein sequence data is growing at an ever-increasing rate, the protein universe is still sparsely annotated with functional and structural annotations. Computational approaches have become efficient solutions to infer annotations for unlabeled proteins by transferring knowledge from proteins with experimental annotations. Despite the increasing availability of protein structure data and the high coverage of high-quality predicted structures, e.g., by AlphaFold, many existing computational tools still only rely on sequence data to predict structural or functional annotations, including alignment algorithms such as BLAST and several sequence-based deep learning models. Here, we develop PenLight, a general deep learning framework for protein structural and functional annotations. Pen-Light uses a graph neural network (GNN) to integrate 3D protein structure data and protein language model representations. In addition, PenLight applies a contrastive learning strategy to train the GNN for learning protein representations that reflect similarities beyond sequence identity, such as semantic similarities in the function or structure space. We benchmarked PenLight on a structural classification task and a functional annotation task, where PenLight achieved higher prediction accuracy and coverage than state-of-the-art methods.
We have gained access to vast amounts of multi-omics data thanks to Next Generation Sequencing. However, it is challenging to analyse this data due to its high dimensionality and much of it not being annotated. Lack of annotated data is a significant problem in machine learning, and Self-Supervised Learning (SSL) methods are typically used to deal with limited labelled data. However, there is a lack of studies that use SSL methods to exploit inter-omics relationships on unlabelled multi-omics data. In this work, we develop a novel and efficient pre-training paradigm that consists of various SSL components, including but not limited to contrastive alignment, data recovery from corrupted samples, and using one type of omics data to recover other omic types. Our pre-training paradigm improves performance on downstream tasks with limited labelled data. We show that our approach outperforms the state-of-the-art method in cancer type classification on the TCGA pancancer dataset in semi-supervised setting. Moreover, we show that the encoders that are pre-trained using our approach can be used as powerful feature extractors even without fine-tuning. Our ablation study shows that the method is not overly dependent on any pretext task component. The network architectures in our approach are designed to handle missing omic types and multiple datasets for pre-training and downstream training. Our pre-training paradigm can be extended to perform zero-shot classification of rare cancers.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.