Loading [MathJax]/jax/output/CommonHTML/jax.js
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Self-Supervised Multi-Label Transformation Prediction for Video Representation Learning

    https://doi.org/10.1142/S0218126622501596Cited by:7 (Source: Crossref)

    Self-supervised learning is a promising paradigm to address the problem of manual-annotation through effectively leveraging unlabeled videos. By solving self-supervised pretext tasks, powerful video representations can be discovered automatically. However, recent pretext tasks for videos rely on utilizing the temporal properties of videos, ignoring the crucial supervisory signals from the spatial subspace of videos. Therefore, we present a new self-supervised pretext task called Multi-Label Transformation Prediction (MLTP) to sufficiently utilize the spatiotemporal information in videos. In MLTP, all videos are jointly transformed by a set of geometric and color-space transformations, such as rotation, cropping, and color-channel split. We formulate the pretext as a multi-label prediction task. The 3D-CNN is trained to predict a composition of underlying transformations as multiple outputs. Thereby, transformation invariant video features can be learned in a self-supervised manner. Experimental results verify that 3D-CNNs pre-trained using MLTP yield video representations with improved generalization performance for action recognition downstream tasks on UCF101 (+2.4%) and HMDB51 (+7.8%) datasets.

    This paper was recommended by Regional Editor Tongquan Wei.