HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
A Crash Course in Modern Hardware

Cliff Click · InfoQ · 145 HN points · 10 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Cliff Click's video "A Crash Course in Modern Hardware".
Watch on InfoQ [↗]
InfoQ Summary
Cliff Click discusses the Von Neumann architecture, CISC vs RISC, Instruction-Level Parallelism, pipelining, out-of-order dispatch, cache misses, memory performance, and tips to improve performance.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Feb 04, 2022 · 86 points, 17 comments · submitted by Tomte
alblue
You might also like my recent presentation on understanding micro architecture:

https://speakerdeck.com/alblue/understanding-cpu-microarchit...

The presentation was recorded and is on YouTube

https://youtu.be/Pa_l3aHCoGc

dang
A couple past small threads:

A Crash Course in Modern Hardware - https://news.ycombinator.com/item?id=3467493 - Jan 2012 (2 comments)

A Crash Course in Modern Hardware - https://news.ycombinator.com/item?id=1394966 - June 2010 (9 comments)

mdaniel
> Speed of Light

> Takes more than a clock cycle for signal to propagate across a complex CPU

Wowers, I had never considered that

Koshkin
Well, that has little to do with the speed of light though, it has more to do with delays due to parasitic capacitances and slow motion of charges that form the conducting channels inside transistors.
dreamcompiler
Light travels 1 foot per nanosecond in a vacuum. (Electricity in wires is slower.)

A 1 GHz processor has a 1 ns cycle time. So yeah, with multi-GHz clocks the speed of light certainly does matter a lot, and it's one reason why keeping everything on the same chip (whenever possible) is important.

Parasitic capacitance and inductance and carrier transport speed are inportant too, but it's not correct to state "it has little to do with the speed of light."

zekrioca
1 foot per ns.. that really confuses me.. Why not something simpler such as 300000 km/s? :)
amelius
Lesson 1: it's almost all locked down by the vendor. Even the development tools.
zqfm
Cool! Anyone have resources on what has changed since then?
alcover
Fair warning : video of really poor quality. Blurry, white-washed presentation screen unreadable, camera follows presenter instead of focusing on screen.
stjohnswarts
To be more fair there is a pretty crisp PowerPoint on the right that is synced to the video that enhances the presentation quite a bit.
aklemm
Can someone provide context for this?
rjsw
The presenters have done interesting stuff with Java on modern hardware, I would expect it to be good.
eternalban
Multi-core realities required teaching developers about the abstracted-away hardware. This talk is a continuation of the surfacing of hardware reality at the language and library levels.
jcranmer
This appears to cover at a high level roughly a "Computer Architecture 201" course: explaining pipelining and cache coherency, with discussion of why/what (but not how) speculative execution, out-of-order, and branch prediction. If you have taken such a course before, this will likely be nothing new to you; if you haven't, it may be interesting.
hvs
Or if you took that class back in the mid-90's it might be interesting. ;)
commandlinefan
That was my thought - either this isn't a crash course in "modern" hardware, or hardware hasn't progressed much in the last 20 years.
jcranmer
At the high level that this is presenting, there really hasn't been any progress in computer architecture. That's not to say that there hasn't been any improvements going on, but the improvements are more like "branch prediction is X% better" [1] or "we can issue an additional instruction per cycle", which don't have a major impact on the overall story presented here.

Nor have any alternative architectures really demonstrated themselves to be competitive. GPGPU programming has become a lot more salient, but GPGPU itself is largely the standard CPU programming model with speculative execution logic tuned way down and SMT and SIMD tuned way up (both of which would have been facets of modern hardware even at the time of this presentation). FPGAs have been "the next big thing" for, gosh, 30 years now, but they've remained relegated to niche roles.

[1] Indirect branch prediction in particular has progressed a lot even in the past decade.

Jun 01, 2021 · 3 points, 0 comments · submitted by Tomte
Feb 24, 2020 · 1 points, 0 comments · submitted by Tomte
Aug 22, 2019 · 2 points, 0 comments · submitted by Tomte
Mar 16, 2019 · 2 points, 0 comments · submitted by Tomte
Oct 24, 2018 · 2 points, 0 comments · submitted by Tomte
May 09, 2018 · 3 points, 0 comments · submitted by Tomte
Dec 06, 2017 · 4 points, 0 comments · submitted by rbanffy
Nov 10, 2017 · 3 points, 0 comments · submitted by Tomte
They do. Video here from 7 years ago that talks about it: https://www.infoq.com/presentations/click-crash-course-moder...

Basically, they do speculative execution with register renaming to get quick turn-around if the memory is available in cache.

It really is quite crazy how much faster the cpu is than memory and what tricks it pulls to get around that problem.

Jul 02, 2017 · 4 points, 0 comments · submitted by Tomte
Mar 29, 2017 · 1 points, 0 comments · submitted by Tomte
Nov 18, 2016 · 3 points, 0 comments · submitted by Tomte
Anyone have the presentation from a Intel guy on how the CPU design focus has moved from cycles to cache misses handy?

Edit: never mind, it was not a Intel guy. And i actually had the thing bookmarked (and it still worked).

https://www.infoq.com/presentations/click-crash-course-moder...

vvanders
That's a great talk, brings together a lot of different things I've seen in one place.
vcarl
There's a fantastic, massively upvoted StackOverflow post that can also provide some insight here. This may be a little more accessible, since it's such a significant runtime difference with very simple source code.

http://stackoverflow.com/questions/11227809/why-is-it-faster...

woliveirajr
The question is very interesting and good phrased, and the answer is better than many classes that many students had about processors and so on.
globuous
Thanks so much for sharing this, great read !

For those, like me, that want to play with what this stackoverflow talks about, here's a fiddle of it: https://jsfiddle.net/tbinetruy/Latkmk2q/1/ (code takes 2s to run and loads firebug for console logs).

Feb 09, 2016 · 3 points, 0 comments · submitted by Tomte
For the most part, manufacturing advances drive microarchitecture advances. Smaller feature sizes mean more transistors can be stuffed in the same area. Those transistors can be used to make larger reorder buffers, more registers, more caches, better branch predictors, and more functional units. If you want to know about specific architectural changes in x86 over the years, I strongly recommend Cliff Click's talk: A Crash Course in Modern Hardware.[1]

A lot of the specifics of semiconductor manufacturing are closely-guarded secrets, but Todd Fernandez gave a glimpse in an informal talk titled Inseparable From Magic: Manufacturing Modern Computer Chips.[2]

1. http://www.infoq.com/presentations/click-crash-course-modern... (starts about 4 minutes in)

2. https://www.youtube.com/watch?v=NGFhc8R_uO4

Jun 05, 2015 · 1 points, 0 comments · submitted by tambourine_man
Jun 05, 2015 · pron on Three months of Rust
I think the opposite is true. Modern hardware is so complex[1], made even more so by its constant interaction with a complex OS, that any sense of familiarity with the actual performance model is illusory, unless you're doing something very controlled and very specific (like, say, DSP). Modern hardware itself is an abstraction, hiding its operation away from you. We can no longer hope to tame hardware with meticulous control over instructions as we were able to up until the nineties.

Forget about clever compilers; forget even about smart JITs; even if you look at such a big abstraction as GCs and only consider large pauses (say anything over a few tens of milliseconds), it is now the case that in a well-tuned application using good a GC, most large pauses aren't even due to GC, but to the OS stopping your program to perform some bookkeeping. Careful control over the instruction stream doesn't even let you avoid 100ms pauses, let alone trying to control nanosecond level effects.

[1]: http://www.infoq.com/presentations/click-crash-course-modern...

dly
And yet I consistently find that checksum tools, compression libraries, and things like video decoders (such as H264 decoders) written in assembly consistently outperform all other implementations I've had to deal with. "Sufficiently smart compiler" is a tired meme at this point. There are few programs that benefit from being entirely written in assembly, but quite a lot who do having parts of them hand optimized. Some, like game emulators, particularly one man job like No$GBA, are still fully written in assembly and its performance is a sight to behold. No$GBA would lose a lot if it were rewritten into a high level language.
pron
> and things like video decoders

That's precisely the example I gave. Although many modern decoders use GPUs, which are much simpler than CPUs (simpler even than 90s era CPUs). The GPU performance model is very simple to comprehend.

> No$GBA would lose a lot if it were rewritten into a high level language.

That's a nice sentiment, but I don't think it is supported by the facts. You could probably write a JIT in Python that would perform much, much better (but that would be overkill, given that you're emulating a very slow, very small machine), and a trivial implementation in Java would probably perform just as well.

The ability to achieve significantly better performance for general-purpose tasks (let's call that "branchy code") with low-level languages today is more myth than reality. What is true that some high-level languages consciously give up on some performance to make development easier, but that's a design choice. That's not to say that optimizing JIT and AOT compilers get everything right -- they don't -- but they get it right often enough that they're very hard to beat.

jamii
Most of us don't have the time for meticulous control over instructions but those who do can certainly use it to good effect eg http://www.reddit.com/r/programming/comments/hkzg8/author_of...

My aversion to piles of opaque heuristics is not because I'm against smart compilers, just that for certain projects I want to be form a mental model of what code I should write to get a certain effect. The trend of modern languages with heavy heuristic optimisations or complex JITs is towards less certainty and less stable optimisations, so that a program that runs fine today might be unusably slow tomorrow.

Staging and compiler-as-a-library is a promising compromise for projects that really care about stable performance eg http://data.epfl.ch/legobase . You can still have an LLVM-smart compiler underneath but you get to make the first pass.

Rust is actually very predictable in some respects eg generic functions will be monomorphised. I prefer it to wrangling GHC or the V8 JIT.

pron
> I want to form a mental model of what code I should write to get a certain effect

And how do you do that with hyperthreading, virtual memory, power management that may decide to power down your core because what you're doing doesn't seem important enough (and that differs greatly from one processor to another) and cache effects on code, data and TLB (all are strongly affected by other threads and processes running on your machine[1])?

While those effects didn't exist much before the 90s, and they don't exist today in GPUs and small embedded devices, on desktops and servers those effects may be much greater in magnitude than any difference you're able to get by better control over generated code. Not running a hypervisor, turning off virtual memory, pinning threads and isolating cores have a much more profound effect on predictability than which language or compiler you're using. Focusing on compilation before taking care of those much more powerful sources of unpredictability is like trying to get a faster car by reducing the weight of the upholstery fabric.

> so that a program that runs fine today might be unusably slow tomorrow.

I think that slowdown actually applies to assembly programs much more than to, say, Java. As CPU architecture changes, it's actually easier to keep higher-level code performant. I mean, why do you assume that compiler changes will hurt your code performance more than CPU changes?

> You can still have an LLVM-smart compiler underneath but you get to make the first pass.

There are many ways to produce good machine code (my favorite is Graal, HotSpot's next-gen JIT), but none of them really give you a good mental model of what's going on. You may like one approach over another for personal aesthetic reasons, one approach may actually produce better results for some workloads than others, and some approaches really are more predictable -- but no approach produces categorically predictable results, and more predictability doesn't buy you better performance (though it still requires more effort).

It used to be that if you knew what instructions your compiler would emit, you knew how your program would perform. That is just no longer the case (well, it is to some degree, but other effects are stronger). A single instruction may perform anywhere within 7 orders of magnitude (L1 cache hit to virtual memory miss) depending on effects outside the program's control! (of course, those high-volatility costs are usually amortized, but so is a less unpredictable compiler output).

[1]: That is the key to cryptographic attacks that let a process sense what a cryptography algorithm running in another process is doing by the way the cryptographic computation affects the performance of the first process.

jamii
I think you are taking a very black and white point of view. Yes, hardware is complex and unpredictable. That doesn't mean that we can't reason at all about performance.

I take a program, measure it's performance on a wide range of real-world workloads across multiple different machines. Then I change some numeric routine to use unboxed integers instead of boxed integers. I measure it again on a wide range of real-world workloads across multiple different machines and find that it is significantly faster in all cases. My approximate mental model of how the machine works allowed me to make a change that empirically improved performance. My model is not perfect so I do have to measure carefully, but it is what allows me to make sensible decisions about which changes to measure rather than just changing things at random.

In a language where the compiler controls unboxing, my mental model is much more approximate. I have to figure out how to influence the heuristics to lead them into making the correct choice, and the solutions tend to be hacks that are highly sensitive to small changes to the heuristics, leading to conversations like https://groups.google.com/forum/#!topic/clojure/GvNLOrN3lGA .

Performance for non-tuned code may be better on average but my ability to tune important areas is reduced. If the compiler was more predictable, or had a interface that allowed me to add information, or if I could make my own passes then that trade-off would go away. I'm not against smart compilers, I'm against smart compilers that don't talk to me.

pron
> I'm not against smart compilers, I'm against smart compilers that don't talk to me.

There are some extremely interesting advances in that area in OpenJDK. Java 9 will contain two relevant changes. The first, JEP 165[1] (fine-grained and method-context dependent control of the JVM compilers), lets you control compilation with metadata depending on context (e.g. inline method foo when called from bar); a much more interesting and powerful enhancement targeted for Java 9 is JEP 243[2] (Java-Level JVM Compiler Interface). It will do the following:

* Allow the JVM to load Java plug-in code to examine and intercept JVM JIT activity.

* Record events related to compilation, including counter overflow, compilation requests, speculation failure, and deoptimization.

* Allow queries to relevant metadata, including loaded classes, method definitions, profile data, dependencies (speculative assertions), and compiled code cache.

* Allow an external module to capture compilation requests and produce code to be used for compiled methods.

This opens the door to what I think is the most impressive compiler of the last decade, and a true breakthrough in (JIT) compiler design: Graal[3]. Graal supports languages of any level (it already has frontends for Java, C, Ruby, Python, R and JavaScript), and then allows complete control over code generation and optimization decisions at runtime. E.g. you tell it what kind of speculations to make, and it tells you which speculations failed. Unlike LLVM, you compile your language into a semantic AST (that may or may not match the language's AST) and feed it to Graal, but each node may contain not just semantics but instructions on speculation and code-gen control at any level you wish. During compilation, Graal interacts with the node and the node gives further instructions. As I understand it, JEP 243 will allow to plug Graal into the standard OpenJDK HotSpot (though at reduced speed), until Graal matures enough to become HotSpot's default compiler.

So what Graal will do is let the developer (if the language designer allows), write simple, high-level code, but tell the compiler, "listen, compile however you like, but when you get to this function, talk to me because I have some ideas on how to compile it just right".

[1]: http://openjdk.java.net/jeps/165

[2]: http://openjdk.java.net/jeps/243

[3]: https://wiki.openjdk.java.net/display/Graal/Publications+and...

jamii
Thanks, that is really interesting. I'll have to look into it.
Those of you wanting to know more about this may be interested in Cliff Click's Crash Course in Modern Hardware.[1] It does a pretty good job of explaining how pipelined, superscalar, OoO CPUs came to be.

1. http://www.infoq.com/presentations/click-crash-course-modern...

No argument there.

BTW, for anyone interested in an overview of the non-determinism built underlying modern hardware architectures, I recommend watching this great talk[1] -- A Crash Course in Modern Hardware -- by Cliff Click, one of the world's top JIT experts.

[1]: http://www.infoq.com/presentations/click-crash-course-modern...

May 13, 2014 · chroma on Computers are fast
For an in-depth presentation on how we got to this point (cache misses dominating performance), there's an informative and interesting talk by Cliff Click called A Crash Course in Modern Hardware: http://www.infoq.com/presentations/click-crash-course-modern...

The talk starts just after 4 minutes in.

May 06, 2014 · 14 points, 0 comments · submitted by signa11
> In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.

Yep. There's a great talk about this by Cliff Click, called A Crash Course in Modern Hardware[1] that I would recommend to everyone. I am no hardware expert, so that talk really enlightened me.

Regarding the issue at hand, I remember Doug Lea saying[2] that some new Intel processors may recognize a loop as the OSs idle loop and power down the core. That's why he computes random numbers in busy-wait loops.

[1] http://www.infoq.com/presentations/click-crash-course-modern...

[2] http://emergingtech.chariotsolutions.com/2013/04/phillyete-s...

Summary of the links shared here:

http://blip.tv/clojure/michael-fogus-the-macronomicon-597023...

http://blog.fogus.me/2011/11/15/the-macronomicon-slides/

http://boingboing.net/2011/12/28/linguistics-turing-complete...

http://businessofsoftware.org/2010/06/don-norman-at-business...

http://channel9.msdn.com/Events/GoingNative/GoingNative-2012...

http://channel9.msdn.com/Shows/Going+Deep/Expert-to-Expert-R...

http://en.wikipedia.org/wiki/Leonard_Susskind

http://en.wikipedia.org/wiki/Sketchpad

http://en.wikipedia.org/wiki/The_Mother_of_All_Demos

http://io9.com/watch-a-series-of-seven-brilliant-lectures-by...

http://libarynth.org/selfgol

http://mollyrocket.com/9438

https://github.com/PharkMillups/killer-talks

http://skillsmatter.com/podcast/java-jee/radical-simplicity/...

http://stufftohelpyouout.blogspot.com/2009/07/great-talk-on-...

https://www.destroyallsoftware.com/talks/wat

https://www.youtube.com/watch?v=0JXhJyTo5V8

https://www.youtube.com/watch?v=0SARbwvhupQ

https://www.youtube.com/watch?v=3kEfedtQVOY

https://www.youtube.com/watch?v=bx3KuE7UjGA

https://www.youtube.com/watch?v=EGeN2IC7N0Q

https://www.youtube.com/watch?v=o9pEzgHorH0

https://www.youtube.com/watch?v=oKg1hTOQXoY

https://www.youtube.com/watch?v=RlkCdM_f3p4

https://www.youtube.com/watch?v=TgmA48fILq8

https://www.youtube.com/watch?v=yL_-1d9OSdk

https://www.youtube.com/watch?v=ZTC_RxWN_xo

http://vimeo.com/10260548

http://vimeo.com/36579366

http://vimeo.com/5047563

http://vimeo.com/7088524

http://vimeo.com/9270320

http://vpri.org/html/writings.php

http://www.confreaks.com/videos/1071-cascadiaruby2012-therap...

http://www.confreaks.com/videos/759-rubymidwest2011-keynote-...

http://www.dailymotion.com/video/xf88b5_jean-pierre-serre-wr...

http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hic...

http://www.infoq.com/presentations/click-crash-course-modern...

http://www.infoq.com/presentations/miniKanren

http://www.infoq.com/presentations/Simple-Made-Easy

http://www.infoq.com/presentations/Thinking-Parallel-Program...

http://www.infoq.com/presentations/Value-Identity-State-Rich...

http://www.infoq.com/presentations/We-Really-Dont-Know-How-T...

http://www.mvcconf.com/videos

http://www.slideshare.net/fogus/the-macronomicon-10171952

http://www.slideshare.net/sriprasanna/introduction-to-cluste...

http://www.tele-task.de/archive/lecture/overview/5819/

http://www.tele-task.de/archive/video/flash/14029/

http://www.w3.org/DesignIssues/Principles.html

http://www.youtube.com/watch?v=4LG-RtcSYUQ

http://www.youtube.com/watch?v=4XpnKHJAok8

http://www.youtube.com/watch?v=5WXYw4J4QOU

http://www.youtube.com/watch?v=a1zDuOPkMSw

http://www.youtube.com/watch?v=aAb7hSCtvGw

http://www.youtube.com/watch?v=agw-wlHGi0E

http://www.youtube.com/watch?v=_ahvzDzKdB0

http://www.youtube.com/watch?v=at7viw2KXak

http://www.youtube.com/watch?v=bx3KuE7UjGA

http://www.youtube.com/watch?v=cidchWg74Y4

http://www.youtube.com/watch?v=EjaGktVQdNg

http://www.youtube.com/watch?v=et8xNAc2ic8

http://www.youtube.com/watch?v=hQVTIJBZook

http://www.youtube.com/watch?v=HxaD_trXwRE

http://www.youtube.com/watch?v=j3mhkYbznBk

http://www.youtube.com/watch?v=KTJs-0EInW8

http://www.youtube.com/watch?v=kXEgk1Hdze0

http://www.youtube.com/watch?v=M7kEpw1tn50

http://www.youtube.com/watch?v=mOZqRJzE8xg

http://www.youtube.com/watch?v=neI_Pj558CY

http://www.youtube.com/watch?v=nG66hIhUdEU

http://www.youtube.com/watch?v=NGFhc8R_uO4

http://www.youtube.com/watch?v=Nii1n8PYLrc

http://www.youtube.com/watch?v=NP9AIUT9nos

http://www.youtube.com/watch?v=OB-bdWKwXsU&playnext=...

http://www.youtube.com/watch?v=oCZMoY3q2uM

http://www.youtube.com/watch?v=oKg1hTOQXoY

http://www.youtube.com/watch?v=Own-89vxYF8

http://www.youtube.com/watch?v=PUv66718DII

http://www.youtube.com/watch?v=qlzM3zcd-lk

http://www.youtube.com/watch?v=tx082gDwGcM

http://www.youtube.com/watch?v=v7nfN4bOOQI

http://www.youtube.com/watch?v=Vt8jyPqsmxE

http://www.youtube.com/watch?v=vUf75_MlOnw

http://www.youtube.com/watch?v=yJDv-zdhzMY

http://www.youtube.com/watch?v=yjPBkvYh-ss

http://www.youtube.com/watch?v=YX3iRjKj7C0

http://www.youtube.com/watch?v=ZAf9HK16F-A

http://www.youtube.com/watch?v=ZDR433b0HJY

http://youtu.be/lQAV3bPOYHo

http://yuiblog.com/crockford/

ricardobeat
And here are them with titles + thumbnails:

http://bl.ocks.org/ricardobeat/raw/5343140/

waqas-
how awesome are you? thanks
Expez
Thank you so much for this!
X4
This is cool :) Btw. the first link was somehow (re)moved. The blip.tv link is now: http://www.youtube.com/watch?v=0JXhJyTo5V8
Cliff Click: A Crash Course in Modern Hardware is high up for me http://www.infoq.com/presentations/click-crash-course-modern...
Jan 15, 2012 · 4 points, 2 comments · submitted by dantheman
pron
An oldie but a goodie.
dantheman
This is a great presentation that goes over modern hardware. It's primarily about cache misses and their impact on performance. Below are some notes on the presentation (time - note).

14:30 - cache hit take 2/3 clocks - miss to memory take 200/300 clocks - 100X cost

15:20 - in multicore you hit l3 because of bandwidth & 1 ft of wire is 1 ghz clock

18 minutes - shadow processing; kind of how the cray does ii

25:30 - out of order execution & cache miss

30 - results - 7 ops out of 300 due to cache miss

33 - miss rates are low; but a tiny (5%) missrate dominates performance

52:20 - cahce misses are hard to detect; they just look like busy cpu top doesn't help...

Aug 12, 2011 · 1 points, 0 comments · submitted by ahalan
Jan 13, 2010 · 8 points, 0 comments · submitted by lucifer
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.