Attention-only Transformers [34] have been applied to solve Natural Language Processing (NLP) tasks and Computer Vision (CV) tasks. One particular Transformer architecture developed for CV is the Vision Transformer (ViT) [15]. ViT models have been used to solve numerous tasks in the CV area. One interesting task is the pose estimation of a human subject. We present our modified ViT model, Un-TraPEs (UNsupervised TRAnsformer for Pose Estimation), that can reconstruct a subject’s pose from its monocular image and estimated depth. We compare the results obtained with such a model against a ResNet [17] trained from scratch and a ViT finetuned to the task and show promising results.
Dettaglio pubblicazione
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pages 3-20 (volume: 13589)
Unsupervised Pose Estimation by Means of an Innovative Vision Transformer (04b Atto di convegno in volume)
Brandizzi N., Fanti A., Gallotta R., Russo S., Iocchi L., Nardi D., Napoli C.
ISBN: 978-3-031-23479-8; 978-3-031-23480-4
keywords