Abstract
In this paper, we introduce a novel self-supervised visual representation learning method which understands both images and videos in a joint learning fashion. The proposed neural network architecture and objectives are designed to obtain two different Convolutional Neural Networks for solving visual recognition tasks in the domain of videos and images. Our method called Video/Image for Visual Contrastive Learning of Representation(Vi2CLR) uses unlabeled videos to exploit dynamic and static visual cues for self-supervised and instances similarity/dissimilarity learning. Vi2CLR optimization pipeline consists of visual clustering part and representation learning based on groups of similar positive instances within a cluster and negative ones from other clusters and learning visual clusters and their distances. We show how a joint self-supervised visual clustering and instance similarity learning with 2D (image) and 3D (video) CovNet encoders yields such robust and near to supervised learning performance.We extensively evaluate the method on downstream tasks like large scale action recognition, image and object classification on datasets like Kinetics, ImageNet, Pascal VOC’07 and UCF101 and achieve outstanding results compared to state-of-the-art self-supervised methods. Show more
Publication status
publishedExternal links
Book title
2021 IEEE/CVF International Conference on Computer Vision (ICCV)Pages / Article No.
Publisher
IEEEEvent
Subject
Video analysis and understanding; Recognition and classification; Representation lerarningMore
Show all metadata
ETH Bibliography
yes
Altmetrics