Hacker News Comments on "An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)" Deep Learning Explainer Youtube Video

Rankings: this week · month (mar/apr) · year (2024) · all time

digests · search

Hacker News Comments on
An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)

Deep Learning Explainer · Youtube · 43 HN points · 0 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention Deep Learning Explainer's video "An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)".

Youtube Summary

This paper applies a pure transformer-based model (Vision Transformer) to a sequence of image patches for image recognition. It shows that when the Transformer is trained on a large dataset, it starts outperforming CNN-based models. And it achieves state-of-the-art on multiple image classification datasets. More importantly, it takes much less time to pre-train for Vision Transformer compared to other models.

0:00 - How many words is an image worth
2:17 - What's special about this paper
4:32 - Self-attention to images
7:05 - How it works
8:06 - Vision Transformer (ViT)
10:30 - Patch embedding
15:50 - [class] token
17:04 - Positional embedding
23:10 - Different ways to embed position info
25:49 - Model architecture
28:35 - Hybrid architecture
30::28 - Pre-training & fine-tuning
31:37 - Fine-tuning on higher resolution images
34:20 - Datasets
34:27 - Model variants
34:55 - Comparison to state-of-the-art
39:05 - Model size v.s data size
43:36 - Scaling study
46:32 - Attention heads
47:47 - Attention distance over layers
48:11 - Attention pattern analysis
51:14 - Self-supervised pre-training
54:16 - Summary

Related videos
Transformer Architecture Explained
https://youtu.be/ELTGIye424E

Quantifying Attention Flow In Transformers
https://youtu.be/3Q0ZXqVaQPo

Paper
https://openreview.net/pdf?id=YicbFdNTTy

Code
https://github.com/lucidrains/vit-pytorch

Abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

Transformers for Image Recognition at Scale [video]

⬐

Oct 12, 2020 · 43 points, 11 comments · submitted by deeplstm

⬐ ycombinatorrio
Another video which explained also quite well the intuition behind this paper, from Yannick Kilcher: https://www.youtube.com/watch?v=TrdevFK_am4

⬐ deeplstm
good to know!

⬐ ximeng
Link to paper for convenience: https://openreview.net/pdf?id=YicbFdNTTy

⬐ deeplstm
thanks for adding it!

⬐ whoisnnamdi
Thanks for this super detailed video! I've seen some of your others in the past and always found them helpful. Yannick's video on this paper is great as well.
If anyone is looking for a written walkthrough, I did a quick summary/explainer on my personal site: https://whoisnnamdi.com/transformers-image-recognition/

⬐ deeplstm
Thanks for watching! I am glad you find them helpful. And your written summary looks nice!

⬐ pistachiopro
It's still an interesting paper, but I was disappointed they were "just" concatenating an imagine generator with a language model. I'm really excited for when someone figures out concurrently trained models, say, alternating between training passes of GPT-3 and iGPT, such that the very same attention layers deal with both language and and visual/spatial conceptualization. I expect common sense reasoning capabilities to take a huge leap, at that point.

⬐ deeplstm
training on language and visual cue at the time is indeed the next important milestone to achieve.

⬐ ipunchghosts
All these transformer explainqtions are missing a level of detail.
I wish someone would code this model and walk through it along with the training and inference procedures.

⬐ deeplstm
I did skip many details in this video as Transformer architecture is itself a big topic. I would suggest breaking down this paper into 2 components.
1. Transformer 2. How to apply the Transformer to image data
As for the first one, I made a video to go through Transformer architecture, if you're interested https://youtu.be/ELTGIye424E
For the second one, what you need to code except for the Transformer is a pre-processing pipeline to break an image into a sequence of patches.

⬐ deeplstm
I spent the weekend reading this super interesting paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", and made this video to explain it a bit. I hope it can be helpful for those who're also interested in how the transformer architecture (de-facto model in NLP) can be applied in computer vision.

Hacker News Comments on An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)

Hacker News Stories and Comments

Hacker News Comments on
An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)