Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Scene recognition is an important computer vision task that has evolved from the study of the biological visual system. Its applications range from video surveillance, autopilot systems, to robotics. The early works were based on feature engineering that involved the computation and aggregation of global and local image descriptors. Several popular image features such as SIFT, SURF, HOG, ORB, LBP, KAZE, etc. have been proposed and applied to the task with successful results.
Features can be either computed from the entire image on a global scale, or extracted from local sub-regions and aggregated across the image. Suitable classifier models are deployed that learn to classify these features. This review paper analyzes several of these handcrafted features that have been applied to the scene recognition task over the past decades, and tracks the transition from the traditional feature engineering to deep learning which forms the current state of the art in computer vision. Deep learning is now deemed to have overtaken feature engineering in several computer vision applications. Deep convolutional neural networks and vision transformers are the current state of the art for object recognition. However, scenes from urban landscapes are bound to contain similar objects posing a challenge to deep learning solutions for scene recognition. In our study, a critical analysis of feature engineering and deep learning methodologies for scene recognition is provided, and results on benchmark scene datasets are presented, concluding with a discussion on challenges and possible solutions that may facilitate more accurate scene recognition.
This work presents a flexible framework for recognizing 3D objects from 2D views. Similarity-based aspect-graph, which contains a set of aspects and prototypes for these aspects, is employed to represent the database of 3D objects. An incremental database construction method that maximizes the similarity of views in the same aspect and minimizes the similarity of prototypes is proposed as the core of the framework to build and update the aspect-graph using 2D views randomly sampled from a viewing sphere. The proposed framework is evaluated on various object recognition problems, including 3D object recognition, human posture recognition and scene recognition. Shape and color features are employed in different applications with the proposed framework and the top three matching rates show the efficiency of the proposed method.
Labeling images is tedious and a costly work that is required for many applications, for example, tagging, grouping and exploring of image collections. It is also necessary for training visual classifiers that recognize scenes or objects. It is therefore desirable to either reduce the human effort or infer additional knowledge by addressing this task with algorithms that allow for learning image annotations in a semi-supervised manner. In this paper, a semi-supervised annotation learning algorithm is introduced that is based on partitioning the data in a multi-view approach. The method is applied to large, diverse image collections of natural scene images. Experiments are performed on the 15 Scenes and SUN databases. It is shown that for sparsely labeled datasets the proposed annotation learning algorithm is able to infer additional knowledge from the unlabeled samples and therefore improve the performance of visual classifiers in comparison to supervised learning. Furthermore, the proposed algorithm outperforms other related semi-supervised learning approaches.
Aiming to obtain more discriminative features in scene images and overcome the impacts of intra-class differences and inter-class similarities, the paper proposes a scene recognition method that combines attention and context information. First, we introduce the attention mechanism and build a multi-scale attention model. Discriminative information considers salient objects and regions by means of channel attention and spatial attention. Besides, the central loss function joint supervision strategy is introduced to further reduce the misjudgment of intra-class differences. Second, a model based on multi-level context information is proposed to describe the positional relationship between objects, which can effectively alleviate the influence of the similarity of objects between classes. Finally, the two models are merged to give full play to the compatibility of features, so that the final feature representation not only focuses on the effective discriminant information, but also manifests the relative position relationship between significant objects. Extensive experiments have proved that the method in this paper effectively solves the problem of insufficient feature representation in scene recognition tasks, and improves the accuracy of scene recognition.
In the recent decades, various techniques based on deep convolutional neural networks (DCNNs) have been applied to scene classification. Most of the techniques are established upon single-spectral images such that environmental conditions may greatly affect the quality of images in the visible (RGB) spectrum. One remedy for this downside is to merge the infrared (IR) with the visible spectrum for gaining the complementary information in comparison with the unimodal analysis. This paper incorporates the RGB, IR and near-infrared (NIR) images into a multispectral analysis for scene classification. For this purpose, two strategies are adopted. In the first strategy, each RGB, IR and NIR image is separately applied to DCNNs and then classified according to the output score of each network. In addition, an optimal decision threshold is obtained based on the same output score of each network. In the second strategy, three image components are extracted from each type of image using wavelet transform decomposition. Independent DCNNs are then trained on the image components of all the scene classes. Eventually, the final classification of the scene is accomplished through an appropriate ensemble architecture. The use of this architecture alongside a transfer learning approach and simple classifiers leads to lesser computational costs in small datasets. These experiments reveal the superiority of the proposed method over the state-of-the-art architectures in terms of the accuracy of scene classification.
Image classification in the image is the persistent task to be computed in robotics, automobiles, and machine vision for sustainability. Scene categorization remains one of the challenging parts of various multi-media technologies implied in human–computer communication, robotic navigation, video surveillance, medical diagnosing, tourist guidance, and drone targeting. In this research, a Hybrid Mayfly Lévy flight distribution (MLFD) optimization algorithm-tuned deep convolutional neural network is proposed to effectively classify the image. The feature extraction process is a significant task to be executed as it enhances the classifier performance by reducing the execution time and the computational complexity. Further, the classifier is optimally trained by the Hybrid MLFD algorithm which in turn reduces optimization issues. The accuracy of the proposed MLFD-based Deep-CNN using the SCID-2 dataset is 95.2683% at 80% of training and 97.6425% for 10 K-fold. This manifests that the proposed MLFD-based Deep-CNN outperforms all the conventional methods in terms of accuracy, sensitivity, and specificity.
In this paper, we devise a system that is able to take the mobile experience for users to a whole new level, through what has been coined as the Mobile Tourguide System. The system allows a user to find places of interest (e.g., cinemas, restaurants, etc.) using a camera-equipped mobile phone. Several challenges are addressed in the paper, which some more notable ones are the formulation of database for efficient retrieval of images, accuracy in wide-baseline matching and pose recovery. The system is validated by experiments in our campus, with images taken in different viewpoints, resolutions, weather and illumination conditions.