This paper proposes a Unified Monocular-Stereo Visual-Inertial State Estimator (UMS-VINS) that combines monocular, stereo vision, and inertial measurements for vehicle localization in degenerated scenarios. UMS-VINS is a tightly coupled visual-inertial odometry (VIO), which requires a stereo camera and a low-cost inertial measurement unit (IMU). On the one hand, we introduce additional two-dimensional sub-pixel features from the left and/or right cameras. With monocular-stereo features, UMS-VINS can improve the positioning accuracy and robustness by enhancing the quality and quantity of features. On the other hand, a mode selection-based visual-inertial initialization strategy is designed to dynamically choose between stereo visual odometry or VIO according to the inertial motion state and initialization status, which can guarantee successful initialization. The performance on both new real-world and public datasets demonstrates its effectiveness in terms of localization accuracy, localization robustness, and environmental adaptability.
In this paper, we present an accurate and robust pose estimator of rigid, polyhedral objects, based on Artificial Neural Networks (ANN), as suitable for Automated Visual Inspection (AVI) applications. The estimator is novel in the sense that it is trained with different poses of the objects having dimensional deviations within its tolerance range and is therefore robust with respect to within tolerance dimensional errors. The estimation accuracy is scalable and our computer simulation experiments in the existing configurations of ANNs have shown an accuracy better than 4% of the placement error. The ANN based pose estimator offers several advantages over the classical implementations.
In this paper, we propose a genetic algorithm based approach to determine the pose of an object in Automated Visual Inspection having three degrees of freedom. We have investigated the effect of noise at 20 dB SNR and also mismatch resulting from incorrect correspondences between the object space points and the image space points, on the estimation of pose parameters. The maximum error in translation parameters is less than 0.45 cm and rotational error is less than 0.2 degree at 20 dB SNR. The error in parameter estimation is insignificant upto 7 pairs of mismatched points out of 24 points in object space and the results skyrockets when 8 or more pairs of points are mismatched. We have compared our result with that obtained by least square technique and it shows that GA based method outperform the gradient based technique when the number of vertices of the object to be inspected is small. These results have clearly established the robustness of GA in estimating the pose of an object with small number of vertices in automated visual inspection.
In this paper we consider the problem of object recognition and localization in a probabilistic framework. An object is represented by a parametric probability density, and the computation of pose parameters is implemented as a nonlinear parameter estimation problem. The presence of a probabilistic model allows for recognition according to Bayes rule. The introduced probabilistic model requires no prior segmentation but characterizes the statistical properties of observed intensity values in the image plane. A detailed discussion of the applied theoretical framework is followed by a concise experimental evaluation which demonstrates the benefit of the proposed approach.
This paper presents a combined model-based 3D object recognition method motivated by the robust properties of human vision. The human visual system (HVS) is very efficient and robust in identifying and grabbing objects, in part because of its properties of visual attention, contrast mechanism, feature binding, multiresolution and part-based representation. In addition, the HVS combines bottom-up and top-down information effectively using combined model representation. We propose a method for integrating these aspects under a Monte Carlo method. In this scheme, object recognition is regarded as a parameter optimization problem. The bottom-up process initializes parameters, and the top-down process optimizes them. Experimental results show that the proposed recognition model is feasible for 3D object identification and pose estimation.
Accurate measurement of poses and expressions can increase the efficiency of recognition systems by avoiding the recognition of spurious faces. This paper presents a novel and robust pose-expression invariant face recognition method in order to improve the existing face recognition techniques. First, we apply the TSL color model for detecting facial region and estimate the vector X-Y-Z of face using connected components analysis. Second, the input face is mapped by a deformable 3D facial model. Third, the mapped face is transformed to the frontal face which appropriates for face recognition by the estimated pose vector and action unit of expression. Finally, the damaged regions which occur during the process of normalization are reconstructed using PCA. Several empirical tests are used to validate the application of face detection model and the method for estimating facial poses and expression. In addition, the tests suggest that recognition rate is greatly boosted through the normalization of the poses and expression.
In this paper we present an improved color-based planar fiducial marker system. Our framework provides precise and robust full 3D pose estimation of markers with superior accuracy when compared with many fiducial systems in the literature, while color information encoding enables using over 65 000 distinct markers. Unlike most color-based fiducial frameworks, which requires prior classification training and color calibration, ours can perform reliably under illumination changes, requiring but a rough white balance adjustment. Our methodology provides good detection performance even under poor illumination conditions which typically compromise other marker identification techniques, thus avoiding the evaluation of otherwise falsely identified markers. Several experiments are presented and carefully analyzed, in order to validate our system and demonstrate the significant improvement in estimation accuracy of both position and orientation over traditional techniques.
The number of existing person re-identification datasets is limited, and there are a series of changes such as illumination, background occlusion and pose among each dataset, which makes it difficult for the existing methods to learn robust feature representation, leading to a decline in recognition performance. To solve these problems, a person re-identification method combining style and pose generation is proposed in this paper. First with the impact of camera style differences in collecting images from different cameras, a style transformation method based on generative adversarial network is introduced into a person re-identification model, and cyclic generative adversarial networks (CycleGAN) is used to realize style transfer and reduce the influence of camera differences. Second, in view of the problem that when pedestrian pose changes greatly, easy to ignore identity-sensitive related information, AlphaPose is introduced to implement pose estimation. Combining style and posture for the first time and using improved deep convolution generative adversarial network (DCGAN) structure enrich the input sample information and generate unified style pose image; while using new synthetic data to train person re-identification network model improves the recognition performance of the model. Finally, further introducing random erasure method during data enhancement, in order to reduce the overfitting phenomena, improves the generalization ability of the network simultaneously and solves partial occlusion. The experimental results show that the proposed method outperforms typical style-based or pose-based methods. The accuracy of rank-1 and mAP on Market-1501 dataset is 90.4% and 74.5%, respectively, which are 2.28% and 5.78% higher, respectively. To a certain extent, the performance of person re-identification is improved.
When a top-down method is taken to the task of human pose estimation, the accuracy of joint point localization is often limited by the accuracy of human detection. In addition, conventional algorithms commonly encode the image to generate a heat map before processing, but the systematic error in decoding the heat map back to the original image has an impact on the positioning. Therefore, to address the two problems, we propose an algorithm that uses multiple angle models to generate the human boxes and then performs lightweight decoding to recover the image. The new boxes can better fit humans and the recovery error can be reduced. First, we split the backbone network into three sub-networks, the first sub-network is responsible for generating the original human box, the second sub-network is responsible for generating a coarse pose estimation in the boxes, and the third sub-network is responsible for a high-precision pose estimation. In order to make the human box fit the human body better, with only a small number of interfering pixels inside the box, models of the human boxes with multiple rotation angles are generated. The results from the second sub-network are used to select the best human box. Using this human box as input to the third sub-network can significantly improve the accuracy of the pose estimation. Then to reduce the errors arising from image decoding, we propose a lightweight unbiased decoding strategy that differs from traditional methods by combining multiple possible offsets to select the direction and size of the final offset. On the MPII dataset and the COCO dataset, we compare the proposed algorithm with 11 state-of-the-art algorithms. The experimental results show that the algorithm achieves a large improvement in accuracy for a wide range of image sizes and different metrics.
Existing linear solutions for the pose estimation (or exterior orientation) problem suffer from a lack of robustness and accuracy partially due to the fact that the majority of the methods utilize only one type of geometric entity and their frameworks do not allow simultaneous use of different types of features. Furthermore, the orthonormality constraints are weakly enforced or not enforced at all. We have developed a new analytic linear least-squares framework for determining pose from multiple types of geometric features. The technique utilizes correspondences between points, between lines and between ellipse–circle pairs. The redundancy provided by different geometric features improves the robustness and accuracy of the least-squares solution. A novel way of approximately imposing orthonormality constraints on the sought rotation matrix within the linear framework is presented. Results from experimental evaluation of the new technique using both synthetic data and real images reveal its improved robustness and accuracy over existing direct methods.
In this paper, a software toolchain is presented for the fully automatic alignment of a 3D human face model. Beginning from a point cloud of a human head (previously segmented from its background), pose normalization is obtained using an innovative and purely geometrical approach. In order to solve the six degrees of freedom raised by this problem, we first exploit the human face's natural mirror symmetry; secondly, we analyze the frontal profile shape; and finally, we align the model's bounding box according to the position of the tip of the nose. The whole procedure is considered as a two-fold, multivariable optimization problem which is addressed by the use of multi-level, genetic algorithms and a greedy search stage, with the latter being compared against standard PCA. Experiments were conducted utilizing a GavabDB database and took into account proper preprocessing stages for noise filtering and head model reconstruction. Outcome results reveal strong validity in this approach, however, at the price of high computational complexity.
In this paper, two different methodologies respectively based on an unsupervised self-organizing (SOM) neural network and on a graph matching are shown and discussed to validate the performance of a new 3D facial feature identification and localization algorithm. Experiments are performed on a dataset of 23 3D faces acquired by a 3D laser camera at eBIS lab with pose and expression variations. In particular results referred to five nose landmarks are encouraging and reveal the validity of this approach that although low computational complexity and the small number of landmarks guarantees an average face recognition performance greater than 80%.
Template warping is a widely used approach in the pose recovery field. Due to its flexibility, template warping is especially useful in monocular video motion tracking. Tracking accuracy is largely determined by the appearance of the 3D model. Moreover, the appearance of the 3D model plays a very important role in template warping. However, accurate tracking results are difficult to obtain because hand estimated 3D models are prone to inaccuracies (the appearance of 3D models is commonly initialized by the user based on the first frame of video footage; the result is a hand estimated 3D model). In order to overcome this problem, we propose an iterative optimization process that estimates the appearance of the 3D model during tracking. Our results show that our proposed 3D model estimation achieves better performance than the hand model initialization estimation.
Face recognition and head posture estimation have aroused a lot of academic interest recently since the inherent information improves the performance of face-related applications such as face alignment, augmented reality, healthcare applications, and emotion detection. The proposed work explores the challenges of identifying people and determining head posture. An analysis of the features produced at intermediate layers by limiting the number of kernals is performed and improved the performance of detecting the person. In addition, the learned features are sent to forest trees in order to determine the exact head attitude of the detected person. The proposed Forest CNN (FCNN) architecture is tested for head pose estimation methods on the Pointing 04 and Facepix benchmark databases.
Surface registration involving the estimation of a rigid transformation (pose) which aligns a model provided as a triangulated mesh with a set of discrete points (range data) sampled from the actual object is a core task in computer vision. This paper refines and explores the previously introduced notion of Continuum Shape Constraint Analysis (CSCA) which allows the assessment of object shape towards predicting the performance of surface registration algorithms. Conceived for computer-vision assisted spacecraft rendezvous analysis, the approach was developed for blanket or localized scanning by LIDAR or similar range-finding scanner that samples non-specific points from the object across an area. Based on the use of Iterative Closest-Point Algorithm (ICP) for pose estimation, CSCA is applied to a surface-based self-registration cost function which takes into account the direction from which the surface is scanned. The continuum nature of the CSCA formulation generates a registration cost matrix and any derived metrics as pure shape properties of the object. For the context of directional scanning as considered in the paper, these properties also become functions of viewing direction and is directly applicable to the best view problem for LIDAR/ICP pose estimation. This paper introduces the Expectivity Index and uses it to illustrate the ability of the CSCA approach to identify productive views via the expected stability of the global minimum solution. Also demonstrated through the examples, CSCA can be used to produce visual maps of geometric constraint that facilitate human interpretation of the information about the shape. Like the ICP algorithm it supports, the CSCA approach processes shape information without the need for specific feature identification and is applicable to any type of object.
Camera pose estimation from video images is a fundamental problem in machine vision and Augmented Reality (AR) systems. Most developed solutions are either linear for both n points and n lines, or iterative depending on nonlinear optimization of some geometric constraints. In this paper, we first survey several existing methods and compare their performances in an AR context. Then, we present a new linear algorithm which is based on square fiducials localization technique to give a closed-form solution to the pose estimation problem, free of any initialization. We also propose an hybrid technique which combines an iterative method, in fact the orthogonal iteration (OI) algorithm, with our own closed form solution. An evaluation of the methods has shown that this hybrid pose estimation technique is accurate and robust. Numerical experiments from real data are given comparing the performances of our hybrid method with several iterative techniques, and demonstrating the efficiency of our approach.
To address the problem of low tracking accuracy caused by many recognized objects in the existing methods, we propose a real-time multi-person pose tracking method using deep reinforcement learning. First, the convolutional neural network (CNN) is used to predict the human key points and center vector in grid mode, make the human key points point to the human center according to the center vector, group the human key points according to the distance from the human key points to the human center, complete the multi-person pose estimation, and obtain the human pose sequence diagram. Then, the human pose sequence diagram is input into the deep reinforcement learning network, and the pose label and category label are output by the supervised learning and training stage. The best pose tracking strategy obtained in the reinforcement learning and training stage is applied to online tracking. Finally, CNN is used to predict the rectangular frame position of the pose instead of the target pose, and the tracking is completed when the pose stops. At this time, the rectangular frame position is the result of multi-person pose tracking. The results show that the maximum expected average overlap (EAO) of the proposed method is 0.53. When the root mean square error (RMSE) of the position component threshold reaches 8, the accuracy has been stable at 0.98%. Therefore, the proposed method has high tracking accuracy. In the future, it can be applied to smart home scenarios to realize smart home human pose tracking, effectively identify human dangerous pose and ensure residents’ life safety.
Object detection refers to investigating the relationships between images or videos and detected objects to improve system utilization and make better decisions. Productivity measurement plays a key role in assessing operational efficiency across different industries. However, capturing the workers’ working status can be resource-intensive and constrained by a limited sample size if the sampling is conducted manually. While the use of object detection approaches has provided a shortcut for collecting image samples, classifying human poses involves training pose estimation models that may often require a substantial effort for annotating the images. In this study, a systematic approach that integrates pose estimation techniques, fuzzy-set theory, and machine learning algorithms has been proposed at an affordable level of computational resources. The Random Forests algorithm has been explored for handling classification tasks, while fuzzy approximation has also been applied to capture the imprecision associated with human poses, enhancing robustness to variability and accounting for inherent uncertainty. Decision-makers can utilize the proposed approach without the need for high computational resources or extensive data collection efforts, making it suitable for deployment in various workplace environments.
There is currently a division between real-world human performance and the decision making of socially interactive robots. This circumstance is partially due to the difficulty in estimating human cues, such as pose and gesture, from robot sensing. Towards bridging this division, we present a method for kinematic pose estimation and action recognition from monocular robot vision through the use of dynamical human motion vocabularies. Our notion of a motion vocabulary is comprised of movement primitives that structure a human's action space for decision making and predict human movement dynamics. Through prediction, such primitives can be used to both generate motor commands for specific actions and perceive humans performing those actions. In this paper, we focus specifically on the perception of human pose and performed actions using a known vocabulary of primitives. Given image observations over time, each primitive infers pose independently using its expected dynamics in the context of a particle filter. Pose estimates from a set of primitives inferencing in parallel are arbitrated to estimate the action being performed. The efficacy of our approach is demonstrated through interactive-time pose and action recognition over extended motion trials. Results evidence our approach requires small numbers of particles for tracking, is robust to unsegmented multi-action movement, movement speed, camera viewpoint and is able to recover from occlusions.
Sign Language is the natural language used by a community that is hearing impaired. It is necessary to convert this language to a commonly understandable form as it is used by a comparatively small part of society. The automatic Sign Language interpreters can convert the signs into text or audio by interpreting the hand movements and the corresponding facial expression. These two modalities work in tandem to give complete meaning to each word. In verbal communication, emotions can be conveyed by changing the tone and pitch of the voice, but in sign language, emotions are expressed using nonmanual movements that include body posture and facial muscle movements. Each such subtle moment should be considered as a feature and extracted using different models. This paper proposes three different models that can be used for varying levels of sign language. The first test was carried out using the Convex Hull-based Sign Language Recognition (SLR) finger spelling sign language, next using a Convolution Neural Network-based Sign Language Recognition (CNN-SLR) for fingerspelling sign language, and finally pose-based SLR for word-level sign language. The experiments show that the pose-based SLR model that captures features using landmark or key points has better SLR accuracy than Convex Hull and CNN-based SLR models.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.