In order to accurately and quickly understand the characteristics and skills of athletes’ traditional sports actions, an unmarked AI recognition algorithm for athletes’ traditional sports actions based on attention mechanism is designed. The lightweight space-time map convolution neural network (ST-GCN) based on attention mechanism in AI technology is used to complete the traditional sports action recognition of athletes. The traditional sports action skeleton map of athletes is constructed as unlabeled samples and input into the ST-GCN network. The time and space characteristics of the input skeleton map are extracted through the time convolution network (TCN) and graph convolution neural network (GCN), respectively. Add graph attention mechanism and channel attention mechanism in the network layer and channel, improve the feature expression ability and action recognition accuracy, and introduce Ghost module to replace the original image convolution work, complete the network lightweight processing, and improve the efficiency of ST-GCN network recognition of sports actions. Complete the classification and recognition of athletes’ traditional sports actions through the standard SoftMax. Experiments show that when the size of the attention mechanism window added in ST-GCN is 1000, the network can have the best performance, and the identified resource cost is relatively minimal. After adding attention mechanism, the F1 score of ST-GCN network in unmarked training is close to 1. The skeleton extraction results of athletes are very accurate, and the algorithm can accurately identify different actions generated by different kinds of movements.
Aiming at solving the problems of low recognition rate, low recognition efficiency and poor recognition effect in the current dance motion recognition methods that are affected by the surrounding environment, this study proposes a dance action recognition model based on the spatial frequency domain features of contour image. This study uses texture information to extract the dance action contour image, solve the feature vector of the contour image by the hypercomplex Fourier transform, and adopts the phase spectrum and energy spectrum transformation to smooth the contour image, so as to generate a saliency map, finally completes the extraction and preprocessing of dance action contour image. This paper distinguishes the high-frequency and low-frequency parts of dance action through the method of discrete cosine transform, calculates the number of pixels contained in the dance action images to be recognized, and extracts the spatial frequency domain features of contour image of dance action, builds the human posture model. This model realizes the dance action recognition by using the classifier to process the above-extracted feature vector and its label. The experimental result shows that the dance action recognition effect of this research model is good, and its recognition rate is high in different dance action types, and can effectively meet the needs of dance action recognition.
Structural seismic stability is an important content of research in the field of structural engineering safety. Current methods of seismic stability assessment of single-layer reticulated dome structures, supporting engineering decisions regarding the strengthening, repair, or demolition of structures, are complex and do not facilitate engineering applications, so the deep learning action recognition networks to analyze the structural deformation are employed and assessment of the seismic stability. In order to tackle the problem of an insufficient receptive field of structural global change during the dynamic response process within networks, a Dual-Branch Attention Module (DBAM) is innovatively proposed, which enables the effective perception of the global deformation of reticulated dome structures. The DBAM consists of the Maxpooling Channel Attentional (MCA) branch and the Large Kernel Pyramid Attentional (LKPA) branch, which can provide the network with multi-scale global perceptual information, thus enhancing the recognition ability of the model. In addition, the ReticDomeSeismic dataset is created by the mapping relations from the displacement intervals to RGB colors proposed, which contains a large amount of video data on the seismic analysis of single-layer reticulated dome structures under different parameters. The dataset was employed to verify the proposed DBAM method, and the experimental results show that the DBAM improves the Mean Accuracy of base action recognition methods by 4.37% on average, the highest Top-1 Accuracy of 93.48%. Therefore, the method proposed for structural deformation recognition can quickly and accurately assess the seismic stability of single-layer reticulated dome structures, and also provides significant insights and guidance for engineering practice.
Sports technology and 3D motion recognition require models to be able to accurately identify athletes’ movements, which is crucial for training analysis, game strategy development and refereeing assistance decisions. To maintain a high recognition rate under different competition scenes, different athlete styles and different environmental conditions, and to ensure the practicality and reliability of the model, two independent 3D convolutional neural networks are applied to construct the action recognition model of Two-stream 3D Residual Networks. According to the temporal-spatial characteristics of human movements in video, the model introduces attention mechanism and combines time dimension to build a Two-stream 3D Residual Networks action recognition model integrating time-channel attention. The average accuracy of Top-1 and Top-5 action recognition models of Two-stream 3D Residual Networks integrated with pre-activation structure is 68.97% and 91.68%. The residual block of the pre-activated structure can enhance the model’s effectiveness. The average precision of Top-1 and Top-5 of action recognition model of two-stream 3D spatio-temporal residual network integrating time-channel attention is 85.73% and 92.05%, which has higher accuracy. The action recognition model of the two-stream 3D spatio-temporal residual network, which integrates time-channel attention, is accurate and achieves good recognition results with volleyball action recognition in real scenes.
Human action recognition using depth sensors is an emerging technology especially in game console industry. Depth information can provide robust features about 3D environments and increase accuracy of action recognition in short ranges. This paper presents an approach to recognize basic human actions using depth information obtained from the Kinect sensor. To recognize actions, features extracted from angle and displacement information of joints are used. Actions are classified using support vector machines and random forest (RF) algorithm. The model is tested on HUN-3D, MSRC-12, and MSR Action 3D datasets with various testing approaches and obtained promising results especially with the RF algorithm. The proposed approach produces robust results independent from the dataset with simple and computationally cheap features.
A new action model is proposed, by revisiting local binary patterns (LBP) for dynamic texture models, applied on trajectory beams calculated on the video. The use of semi-dense trajectory field allows to dramatically reduce the computation support to essential motion information, while maintaining a large amount of data to ensure robustness of statistical bag of features action models. A new binary pattern, called Spatial Motion Pattern (SMP) is proposed, which captures self-similarity of velocity around each tracked point (particle), along its trajectory. This operator highlights the geometric shape of rigid parts of moving objects in a video sequence. SMPs are combined with basic velocity information to form the local action primitives. Then, a global representation of a space × time video block is provided by using hierarchical blockwise histograms, which allows to efficiently represent the action as a whole, while preserving a certain level of spatiotemporal relation between the action primitives. Inheriting from the efficiency and the invariance properties of both the semi-dense tracker Video extruder and the LBP-based representations, the method is designed for the fast computation of action descriptors in unconstrained videos. For improving both robustness and computation time in the case of high definition video, we also present an enhanced version of the semi-dense tracker based on the so-called super particles, which reduces the number of trajectories while improving their length, reliability and spatial distribution.
This paper addresses the problem of action recognition from body pose. Detecting body pose in static image faces great challenges because of pose variability. Our method is based on action-specific hierarchical poselet. We use hierarchical body parts each of which is represented by a set of poselets to demonstrate the pose variability of the body part. Pose signature of a body part is represented by a vector of detection responses of all poselets for the part. In order to suppress detection error and ambiguity we explore to use part-based model (PBM) as detection context. We propose a constrained optimization algorithm for detecting all poselets of each part in context of PBM, which recover neglected pose clue by global optimization. We use a PBM with hierarchical part structure, where body parts have varying granularity from whole body steadily decreasing to limb parts. From the structure we get models with different depth to study saliency of different body parts in action recognition. Pose signature of an action image is composed of pose signature of all the body parts in the PBM, which provides rich discriminate information for our task. We evaluate our algorithm on two datasets. Compared with counterpart methods, pose signature has obvious performance improvement on static image dataset. While using the model trained from static image dataset to label detected action person on video dataset, pose signature achieves state-of-the-art performance.
A human action recognition system based on image depth is proposed in this paper. Depth information features are not easily disturbed by noise; and due to this characteristic, the system can quickly extract foreground targets. Moreover, the target data, namely, depth and two-dimensional (2D) data, are projected to three orthogonal planes. In this manner, the action featured in the depth motion along the optical axis can clearly describe the trajectory. Based on the change of motion energy and the angle variations of motion orientations, the temporal segmentation (TS) method automatically segments the complex action into several simple movements. Three-dimensional (3D) data is further applied to acquire the three-viewpoint (3V) motion history trajectory, whereby a target’s motion is described through the motion history images (MHIs) from the 3Vs. The weightings corresponding to the gradients of the MHIs are included for determining the viewpoint that bests describe the target’s motion. In terms of feature extraction, the application of multi-resolution motion history histograms can effectively reduce the computational load and achieve a high recognition rate. Experimental results demonstrate that the proposed method can effectively solve the self-occlusion problem.
In this paper, we present a new approach for human action recognition using 3D skeleton joints recovered from RGB-D cameras. We propose a descriptor based on differences of skeleton joints. This descriptor combines two characteristics including static posture and overall dynamics that encode spatial and temporal aspects. Then, we apply the mean function on these characteristics in order to form the feature vector, used as an input to Random Forest classifier for action classification. The experimental results on both datasets: MSR Action 3D dataset and MSR Daily Activity 3D dataset demonstrate that our approach is efficient and gives promising results compared to state-of-the-art approaches.
Human action recognition based on depth video sequence is an important research direction in the field of computer vision. The present study proposed a classification framework based on hierarchical multi-view to resolve depth video sequence-based action recognition. Herein, considering the distinguishing feature of 3D human action space, we project the 3D human action image to three coordinate planes, so that the 3D depth image is converted to three 2D images, and then feed them to three subnets, respectively. With the increase of the number of layers, the representations of subnets are hierarchically fused to be the inputs of next layers. The final representations of the depth video sequence are fed into a single layer perceptron, and the final result is decided by the time accumulated through the output of the perceptron. We compare with other methods on two publicly available datasets, and we also verify the proposed method through the human action database acquired by our Kinect system. Our experimental results demonstrate that our model has high computational efficiency and achieves the performance of state-of-the-art method.
To address the problem that many existing approaches are not appropriate for action recognition in low-resolution (LR) videos, this paper presents a framework based on the Dempster–Shafer (DS) theory for this purpose. In the framework, artificial neural networks (ANNs) are firstly trained for every class with training samples, and then basic belief assignments (BBAs) for underlying classes are computed with the trained ANNs. The resulted BBAs are fused from all frames in the whole video sequentially by frame-by-frame based on DS’s rule of fusion. Action recognition is last performed with a threshold-based decision making. We conducted experiments on extensive testing data with various levels of video resolution. Results reveal that the proposed framework: (1) shows outperforming recognition performances compared with state-of-the-art classifications, respectively, such as sequence matching, voting-based strategy and bag-of-words (BoW) method; and (2) can achieve a low observational latency in recognition.
Skeleton-based action recognition distinguishes human actions using the trajectories of skeleton joints, which can be a good representation of human behaviors. Conventional methods usually construct classifiers with hand-crafted or the learned features to recognize human actions. Different from constructing a direct action classifier for action recognition task, this paper attempts to identify human actions based on the development trends of behavior sequences. Specifically, we first utilize the memory neural network to construct action predictors for each kind of activity. These action predictors can then output the action trends at the next time step. According to the predictions of these action predictors at each time step and the removal rule, the poor predictors can be eliminated step by step, and the IDentity(ID) number of the last predictor left is considered as the label of the action sequence to be categorized. We compare the proposed action recognition algorithm using sequence prediction learning with other methods on two publicly available datasets. Our experimental results consistently demonstrate the feasibility and effectiveness of the suggested method. It also proves the importance of prediction learning for action recognition.
It is an obvious fact that drivers’ drowsiness is more likely to cause traffic accidents. Recently, driver drowsiness detection has drawn considerable attention. In this paper, a novel drowsiness detection scheme is proposed, which can recognize drivers’ drowsiness actions through their facial expressions. First, a drowsiness action recognition model based on 3D-CNN is proposed, which can effectively distinguish drivers’ drowsiness actions and nondrowsiness actions. Second, a fusion algorithm of the two input streams is proposed, which can fuse gray image sequence and optical image sequence containing target motion information. Finally, the proposed model is evaluated on National Tsinghua University Driver Drowsiness Detection (NTHU-DDD) dataset. The experimental results show that the algorithm performs better than other algorithms, and its accuracy reaches 86.64%.
Body pose analysis is an important factor of human action recognition. Recently, the proposed Recurrent Neural Networks (RNNs) and deep ConvNets-based methods are showing good performances in learning sequential information. Despite these good performances, RNN lacks to efficiently learn spatial relation between body parts while deep ConvNets require a huge amount of data for training. We propose a Distance-based Neural Network (DNN) for action recognition in static images. We compute effective distances between a set of body part pairs for a given image and feed to DNN to learn effective representation of complex actions. We also propose Distance-based Convolutional Neural Network (DCNN) to learn representations from 2D images. The distances are rearranged in 2D grayscale image called as a Distance Image. This 2D representation allows the network to learn specific discriminative information between adjacent pixel distance values corresponding to different body part pairs. We evaluate our method on two real-world datasets i.e. UT-Interaction and SBU Kinect Interaction. Results show that our proposed method achieves better performance compared to the state-of-the-art approaches.
In this paper, we propose an end-to-end multi-resolution three-dimensional (3D) capsule network for detecting actions of multiple actors in a video scene. Unlike previous capsule, network-based action recognition does not specifically concern with the individual action of multiple actors in a single scene, our 3D capsule network takes advantage of multi-resolution technique to detect different actions of multiple actors that have different sizes, scales, and aspect ratios. Our 3D capsule network is built on top of 3D convolutional neural network (3DCNN) that extracts spatio-temporal features from video frames inside regions of interest generated by Faster RCNN object detection. We first apply our method to the problem of detecting illegal cheating activities in a classroom examination scene with multiple subjects involved. Second, we test our system on the publicly available and extensively studied UCF-101 dataset. We compare our method with several state-of-the-art 3DCNN-based methods, first the multi-resolution 3DCNN, the single-resolution 3D capsule network, and a combination of both these models. We show that models containing 3D capsule networks have a slight advantage over the conventional 3DCNN and multi-resolution 3DCNN. Our 3D capsule networks not only perform a classification of said actions but also generate videos of single actions. Our experimental results show that the use of multi-resolution pathways in the 3D capsule networks make the result even better. Such findings also hold even when we use pre-trained C3D (convolutional 3D) features to train these networks. We believe that the multiple resolutions capture lower-level features at different scales. At the same time, the 3D capsule layers combine these features in more complex ways than conventional convolutional models.
The modeling of channel and temporal information is of crucial importance for action recognition tasks. To build a high-performance action recognition network by effectively capturing channel and temporal information, we propose CLS-Net: an action recognition algorithm based on channel-temporal information modeling. The proposed CLS-Net characterizes channel and temporal information by inserting multiple modules to an end-to-end backbone network, including a channel attention module (CA module) for modeling channel information, a long-term temporal module (LT module) and a short-term temporal module (ST module) for modeling temporal information. Specifically, the CA module extracts the correlation between feature channels so the network can learn to selectively strengthen the features containing useful information and suppress the useless features through global information. The LT module moves some channels in the temporal dimension to realize information interaction across time domains and model global temporal information. The ST module enhances the motion-sensitive features by calculating the feature-level frame difference information and realizes the representation of local motion information. Since the multi-module insertion mode directly affects the whole model’s final performance, we propose a novel multi-module insertion mode instead of a simple series or parallel connection to ensure that the multiple modules can complement one another and cooperate with each other more efficiently. CLS-Net achieves SOTA performance on the EgoGesture and Jester dataset in the same type of network and achieves competitive results on the Something-Something V2 dataset.
Action recognition is a very effective method of computer vision areas. In the last few years, there has been a growing interest in Deep learning networks as the Long Short–Term Memory (LSTM) architectures due to their efficiency in long-term time sequence processing. In the light of these recent events in deep neural networks, there is now considerable concern about the development of an accurate action recognition approach with low complexity. This paper aims to introduce a method for learning depth activity videos based on the LSTM and the classification fusion. The first step consists in extracting compact depth video features. We start with the calculation of Depth Motion Maps (DMM) from each sequence. Then we encode and concatenate contour and texture DMM characteristics using the histogram-of-oriented-gradient and local-binary-patterns descriptors. The second step is the depth video classification based on the naive Bayes fusion approach. Training three classifiers, which are the collaborative representation classifier, the kernel-based extreme learning machine and the LSTM, is done separately to get classification scores. Finally, we fuse the classification score outputs of all classifiers with the naive Bayesian method to get a final predicted label. Our proposed method achieves a significant improvement in the recognition rate compared to previous work that has used Kinect v2 and UTD-MHAD human action datasets.
Self-supervised learning is a promising paradigm to address the problem of manual-annotation through effectively leveraging unlabeled videos. By solving self-supervised pretext tasks, powerful video representations can be discovered automatically. However, recent pretext tasks for videos rely on utilizing the temporal properties of videos, ignoring the crucial supervisory signals from the spatial subspace of videos. Therefore, we present a new self-supervised pretext task called Multi-Label Transformation Prediction (MLTP) to sufficiently utilize the spatiotemporal information in videos. In MLTP, all videos are jointly transformed by a set of geometric and color-space transformations, such as rotation, cropping, and color-channel split. We formulate the pretext as a multi-label prediction task. The 3D-CNN is trained to predict a composition of underlying transformations as multiple outputs. Thereby, transformation invariant video features can be learned in a self-supervised manner. Experimental results verify that 3D-CNNs pre-trained using MLTP yield video representations with improved generalization performance for action recognition downstream tasks on UCF101 (+2.4%) and HMDB51 (+7.8%) datasets.
Action recognition is a challenging task of modeling both spatial and temporal context. Numerous works focus on architectures modality and successfully make worthy progress on this task. While due to the redundancy in time and the limit of computation resources, several works focus on the efficiency study like frame sampling, some for untrimmed videos, and some for trimmed videos. With the intent of improving the effectiveness of action recognition, we propose a novel Computational Spatiotemporal Selector (CSS) to refine and reinforce the key frames with discriminative information in video. Specifically, CSS includes two modules: Temporal Adaptive Sampling (TAS) module and Spatial Frame Resolution (SFR) module. The former can refine the key frames in the temporal space for capturing the key motion information, while the latter can further zoom out some refined frames in the spatial space for eliminating the discrimination-irrelevant structural information. The proposed CSS is flexible to be embedded into most representative action recognition models. Experiments on two challenging action recognition benchmarks, i.e., ActivityNet1.3 and UCF101, show that the proposed CSS improves the performance over most existing models, not only on trimmed videos but also untrimmed videos.
Computational performance associated with high dimensional data is a common challenge for real-world action classification systems. Subspace learning, and manifold learning in particular, have received considerable attention as means of finding efficient low-dimensional representations that lead to better classification and efficient processing. A Grassmann manifold is a space that promotes smooth surfaces where points represent subspaces. In this paper, Grassmannian Spectral Regression (GRASP) is presented as a Grassmann inspired subspace learning algorithm that combines the benefits of Grassmann manifolds and spectral regression for fast and accurate classification. GRASP involves embedding high dimensional action subspaces as individual points onto a Grassmann manifold, kernelizing the embeddings onto a projection space, and then applying Spectral Regression for fast and accurate action classification. Furthermore, spatiotemporal action descriptions called Motion History Surfaces and Motion Depth Surfaces are utilized. The effectiveness of GRASP is illustrated for computationally intensive, multi-view and 3D action classification datasets.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.