HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)

Deep Learning Explainer · Youtube · 43 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Deep Learning Explainer's video "An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)".
Youtube Summary
This paper applies a pure transformer-based model (Vision Transformer) to a sequence of image patches for image recognition. It shows that when the Transformer is trained on a large dataset, it starts outperforming CNN-based models. And it achieves state-of-the-art on multiple image classification datasets. More importantly, it takes much less time to pre-train for Vision Transformer compared to other models.

0:00 - How many words is an image worth
2:17 - What's special about this paper
4:32 - Self-attention to images
7:05 - How it works
8:06 - Vision Transformer (ViT)
10:30 - Patch embedding
15:50 - [class] token
17:04 - Positional embedding
23:10 - Different ways to embed position info
25:49 - Model architecture
28:35 - Hybrid architecture
30::28 - Pre-training & fine-tuning
31:37 - Fine-tuning on higher resolution images
34:20 - Datasets
34:27 - Model variants
34:55 - Comparison to state-of-the-art
39:05 - Model size v.s data size
43:36 - Scaling study
46:32 - Attention heads
47:47 - Attention distance over layers
48:11 - Attention pattern analysis
51:14 - Self-supervised pre-training
54:16 - Summary

Related videos
Transformer Architecture Explained
https://youtu.be/ELTGIye424E

Quantifying Attention Flow In Transformers
https://youtu.be/3Q0ZXqVaQPo

Paper
https://openreview.net/pdf?id=YicbFdNTTy

Code
https://github.com/lucidrains/vit-pytorch

Abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Oct 12, 2020 · 43 points, 11 comments · submitted by deeplstm
ycombinatorrio
Another video which explained also quite well the intuition behind this paper, from Yannick Kilcher: https://www.youtube.com/watch?v=TrdevFK_am4
deeplstm
good to know!
ximeng
Link to paper for convenience: https://openreview.net/pdf?id=YicbFdNTTy
deeplstm
thanks for adding it!
whoisnnamdi
Thanks for this super detailed video! I've seen some of your others in the past and always found them helpful. Yannick's video on this paper is great as well.

If anyone is looking for a written walkthrough, I did a quick summary/explainer on my personal site: https://whoisnnamdi.com/transformers-image-recognition/

deeplstm
Thanks for watching! I am glad you find them helpful. And your written summary looks nice!
pistachiopro
It's still an interesting paper, but I was disappointed they were "just" concatenating an imagine generator with a language model. I'm really excited for when someone figures out concurrently trained models, say, alternating between training passes of GPT-3 and iGPT, such that the very same attention layers deal with both language and and visual/spatial conceptualization. I expect common sense reasoning capabilities to take a huge leap, at that point.
deeplstm
training on language and visual cue at the time is indeed the next important milestone to achieve.
ipunchghosts
All these transformer explainqtions are missing a level of detail.

I wish someone would code this model and walk through it along with the training and inference procedures.

deeplstm
I did skip many details in this video as Transformer architecture is itself a big topic. I would suggest breaking down this paper into 2 components.

1. Transformer 2. How to apply the Transformer to image data

As for the first one, I made a video to go through Transformer architecture, if you're interested https://youtu.be/ELTGIye424E

For the second one, what you need to code except for the Transformer is a pre-processing pipeline to break an image into a sequence of patches.

deeplstm
I spent the weekend reading this super interesting paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", and made this video to explain it a bit. I hope it can be helpful for those who're also interested in how the transformer architecture (de-facto model in NLP) can be applied in computer vision.
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.