Hacker News Comments on
An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)
Deep Learning Explainer
·
Youtube
·
43
HN points
·
0
HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Deep Learning Explainer's video "An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)".
Youtube Summary
HN Theater Rankings
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.
⬐
⬐ ycombinatorrioAnother video which explained also quite well the intuition behind this paper, from Yannick Kilcher: https://www.youtube.com/watch?v=TrdevFK_am4⬐ deeplstm⬐ ximenggood to know!Link to paper for convenience: https://openreview.net/pdf?id=YicbFdNTTy⬐ deeplstm⬐ whoisnnamdithanks for adding it!Thanks for this super detailed video! I've seen some of your others in the past and always found them helpful. Yannick's video on this paper is great as well.If anyone is looking for a written walkthrough, I did a quick summary/explainer on my personal site: https://whoisnnamdi.com/transformers-image-recognition/
⬐ deeplstm⬐ pistachioproThanks for watching! I am glad you find them helpful. And your written summary looks nice!It's still an interesting paper, but I was disappointed they were "just" concatenating an imagine generator with a language model. I'm really excited for when someone figures out concurrently trained models, say, alternating between training passes of GPT-3 and iGPT, such that the very same attention layers deal with both language and and visual/spatial conceptualization. I expect common sense reasoning capabilities to take a huge leap, at that point.⬐ deeplstm⬐ ipunchghoststraining on language and visual cue at the time is indeed the next important milestone to achieve.All these transformer explainqtions are missing a level of detail.I wish someone would code this model and walk through it along with the training and inference procedures.
⬐ deeplstm⬐ deeplstmI did skip many details in this video as Transformer architecture is itself a big topic. I would suggest breaking down this paper into 2 components.1. Transformer 2. How to apply the Transformer to image data
As for the first one, I made a video to go through Transformer architecture, if you're interested https://youtu.be/ELTGIye424E
For the second one, what you need to code except for the Transformer is a pre-processing pipeline to break an image into a sequence of patches.
I spent the weekend reading this super interesting paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", and made this video to explain it a bit. I hope it can be helpful for those who're also interested in how the transformer architecture (de-facto model in NLP) can be applied in computer vision.