Hacker News Comments on
A Crash Course in Modern Hardware
Cliff Click
·
InfoQ
·
145
HN points
·
10
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.⬐ alblueYou might also like my recent presentation on understanding micro architecture:https://speakerdeck.com/alblue/understanding-cpu-microarchit...
The presentation was recorded and is on YouTube
⬐ dangA couple past small threads:A Crash Course in Modern Hardware - https://news.ycombinator.com/item?id=3467493 - Jan 2012 (2 comments)
A Crash Course in Modern Hardware - https://news.ycombinator.com/item?id=1394966 - June 2010 (9 comments)
⬐ mdaniel> Speed of Light> Takes more than a clock cycle for signal to propagate across a complex CPU
Wowers, I had never considered that
⬐ Koshkin⬐ ameliusWell, that has little to do with the speed of light though, it has more to do with delays due to parasitic capacitances and slow motion of charges that form the conducting channels inside transistors.⬐ dreamcompilerLight travels 1 foot per nanosecond in a vacuum. (Electricity in wires is slower.)A 1 GHz processor has a 1 ns cycle time. So yeah, with multi-GHz clocks the speed of light certainly does matter a lot, and it's one reason why keeping everything on the same chip (whenever possible) is important.
Parasitic capacitance and inductance and carrier transport speed are inportant too, but it's not correct to state "it has little to do with the speed of light."
⬐ zekrioca1 foot per ns.. that really confuses me.. Why not something simpler such as 300000 km/s? :)Lesson 1: it's almost all locked down by the vendor. Even the development tools.⬐ zqfmCool! Anyone have resources on what has changed since then?⬐ alcoverFair warning : video of really poor quality. Blurry, white-washed presentation screen unreadable, camera follows presenter instead of focusing on screen.⬐ stjohnswarts⬐ aklemmTo be more fair there is a pretty crisp PowerPoint on the right that is synced to the video that enhances the presentation quite a bit.Can someone provide context for this?⬐ rjsw⬐ jcranmerThe presenters have done interesting stuff with Java on modern hardware, I would expect it to be good.⬐ eternalbanMulti-core realities required teaching developers about the abstracted-away hardware. This talk is a continuation of the surfacing of hardware reality at the language and library levels.This appears to cover at a high level roughly a "Computer Architecture 201" course: explaining pipelining and cache coherency, with discussion of why/what (but not how) speculative execution, out-of-order, and branch prediction. If you have taken such a course before, this will likely be nothing new to you; if you haven't, it may be interesting.⬐ hvsOr if you took that class back in the mid-90's it might be interesting. ;)⬐ commandlinefanThat was my thought - either this isn't a crash course in "modern" hardware, or hardware hasn't progressed much in the last 20 years.⬐ jcranmerAt the high level that this is presenting, there really hasn't been any progress in computer architecture. That's not to say that there hasn't been any improvements going on, but the improvements are more like "branch prediction is X% better" [1] or "we can issue an additional instruction per cycle", which don't have a major impact on the overall story presented here.Nor have any alternative architectures really demonstrated themselves to be competitive. GPGPU programming has become a lot more salient, but GPGPU itself is largely the standard CPU programming model with speculative execution logic tuned way down and SMT and SIMD tuned way up (both of which would have been facets of modern hardware even at the time of this presentation). FPGAs have been "the next big thing" for, gosh, 30 years now, but they've remained relegated to niche roles.
[1] Indirect branch prediction in particular has progressed a lot even in the past decade.
They do. Video here from 7 years ago that talks about it: https://www.infoq.com/presentations/click-crash-course-moder...Basically, they do speculative execution with register renaming to get quick turn-around if the memory is available in cache.
It really is quite crazy how much faster the cpu is than memory and what tricks it pulls to get around that problem.
Anyone have the presentation from a Intel guy on how the CPU design focus has moved from cycles to cache misses handy?Edit: never mind, it was not a Intel guy. And i actually had the thing bookmarked (and it still worked).
https://www.infoq.com/presentations/click-crash-course-moder...
⬐ vvandersThat's a great talk, brings together a lot of different things I've seen in one place.⬐ vcarlThere's a fantastic, massively upvoted StackOverflow post that can also provide some insight here. This may be a little more accessible, since it's such a significant runtime difference with very simple source code.http://stackoverflow.com/questions/11227809/why-is-it-faster...
⬐ woliveirajrThe question is very interesting and good phrased, and the answer is better than many classes that many students had about processors and so on.⬐ globuousThanks so much for sharing this, great read !For those, like me, that want to play with what this stackoverflow talks about, here's a fiddle of it: https://jsfiddle.net/tbinetruy/Latkmk2q/1/ (code takes 2s to run and loads firebug for console logs).
For the most part, manufacturing advances drive microarchitecture advances. Smaller feature sizes mean more transistors can be stuffed in the same area. Those transistors can be used to make larger reorder buffers, more registers, more caches, better branch predictors, and more functional units. If you want to know about specific architectural changes in x86 over the years, I strongly recommend Cliff Click's talk: A Crash Course in Modern Hardware.[1]A lot of the specifics of semiconductor manufacturing are closely-guarded secrets, but Todd Fernandez gave a glimpse in an informal talk titled Inseparable From Magic: Manufacturing Modern Computer Chips.[2]
1. http://www.infoq.com/presentations/click-crash-course-modern... (starts about 4 minutes in)
I think the opposite is true. Modern hardware is so complex[1], made even more so by its constant interaction with a complex OS, that any sense of familiarity with the actual performance model is illusory, unless you're doing something very controlled and very specific (like, say, DSP). Modern hardware itself is an abstraction, hiding its operation away from you. We can no longer hope to tame hardware with meticulous control over instructions as we were able to up until the nineties.Forget about clever compilers; forget even about smart JITs; even if you look at such a big abstraction as GCs and only consider large pauses (say anything over a few tens of milliseconds), it is now the case that in a well-tuned application using good a GC, most large pauses aren't even due to GC, but to the OS stopping your program to perform some bookkeeping. Careful control over the instruction stream doesn't even let you avoid 100ms pauses, let alone trying to control nanosecond level effects.
[1]: http://www.infoq.com/presentations/click-crash-course-modern...
⬐ dlyAnd yet I consistently find that checksum tools, compression libraries, and things like video decoders (such as H264 decoders) written in assembly consistently outperform all other implementations I've had to deal with. "Sufficiently smart compiler" is a tired meme at this point. There are few programs that benefit from being entirely written in assembly, but quite a lot who do having parts of them hand optimized. Some, like game emulators, particularly one man job like No$GBA, are still fully written in assembly and its performance is a sight to behold. No$GBA would lose a lot if it were rewritten into a high level language.⬐ pron⬐ jamii> and things like video decodersThat's precisely the example I gave. Although many modern decoders use GPUs, which are much simpler than CPUs (simpler even than 90s era CPUs). The GPU performance model is very simple to comprehend.
> No$GBA would lose a lot if it were rewritten into a high level language.
That's a nice sentiment, but I don't think it is supported by the facts. You could probably write a JIT in Python that would perform much, much better (but that would be overkill, given that you're emulating a very slow, very small machine), and a trivial implementation in Java would probably perform just as well.
The ability to achieve significantly better performance for general-purpose tasks (let's call that "branchy code") with low-level languages today is more myth than reality. What is true that some high-level languages consciously give up on some performance to make development easier, but that's a design choice. That's not to say that optimizing JIT and AOT compilers get everything right -- they don't -- but they get it right often enough that they're very hard to beat.
Most of us don't have the time for meticulous control over instructions but those who do can certainly use it to good effect eg http://www.reddit.com/r/programming/comments/hkzg8/author_of...My aversion to piles of opaque heuristics is not because I'm against smart compilers, just that for certain projects I want to be form a mental model of what code I should write to get a certain effect. The trend of modern languages with heavy heuristic optimisations or complex JITs is towards less certainty and less stable optimisations, so that a program that runs fine today might be unusably slow tomorrow.
Staging and compiler-as-a-library is a promising compromise for projects that really care about stable performance eg http://data.epfl.ch/legobase . You can still have an LLVM-smart compiler underneath but you get to make the first pass.
Rust is actually very predictable in some respects eg generic functions will be monomorphised. I prefer it to wrangling GHC or the V8 JIT.
⬐ pron> I want to form a mental model of what code I should write to get a certain effectAnd how do you do that with hyperthreading, virtual memory, power management that may decide to power down your core because what you're doing doesn't seem important enough (and that differs greatly from one processor to another) and cache effects on code, data and TLB (all are strongly affected by other threads and processes running on your machine[1])?
While those effects didn't exist much before the 90s, and they don't exist today in GPUs and small embedded devices, on desktops and servers those effects may be much greater in magnitude than any difference you're able to get by better control over generated code. Not running a hypervisor, turning off virtual memory, pinning threads and isolating cores have a much more profound effect on predictability than which language or compiler you're using. Focusing on compilation before taking care of those much more powerful sources of unpredictability is like trying to get a faster car by reducing the weight of the upholstery fabric.
> so that a program that runs fine today might be unusably slow tomorrow.
I think that slowdown actually applies to assembly programs much more than to, say, Java. As CPU architecture changes, it's actually easier to keep higher-level code performant. I mean, why do you assume that compiler changes will hurt your code performance more than CPU changes?
> You can still have an LLVM-smart compiler underneath but you get to make the first pass.
There are many ways to produce good machine code (my favorite is Graal, HotSpot's next-gen JIT), but none of them really give you a good mental model of what's going on. You may like one approach over another for personal aesthetic reasons, one approach may actually produce better results for some workloads than others, and some approaches really are more predictable -- but no approach produces categorically predictable results, and more predictability doesn't buy you better performance (though it still requires more effort).
It used to be that if you knew what instructions your compiler would emit, you knew how your program would perform. That is just no longer the case (well, it is to some degree, but other effects are stronger). A single instruction may perform anywhere within 7 orders of magnitude (L1 cache hit to virtual memory miss) depending on effects outside the program's control! (of course, those high-volatility costs are usually amortized, but so is a less unpredictable compiler output).
[1]: That is the key to cryptographic attacks that let a process sense what a cryptography algorithm running in another process is doing by the way the cryptographic computation affects the performance of the first process.
⬐ jamiiI think you are taking a very black and white point of view. Yes, hardware is complex and unpredictable. That doesn't mean that we can't reason at all about performance.I take a program, measure it's performance on a wide range of real-world workloads across multiple different machines. Then I change some numeric routine to use unboxed integers instead of boxed integers. I measure it again on a wide range of real-world workloads across multiple different machines and find that it is significantly faster in all cases. My approximate mental model of how the machine works allowed me to make a change that empirically improved performance. My model is not perfect so I do have to measure carefully, but it is what allows me to make sensible decisions about which changes to measure rather than just changing things at random.
In a language where the compiler controls unboxing, my mental model is much more approximate. I have to figure out how to influence the heuristics to lead them into making the correct choice, and the solutions tend to be hacks that are highly sensitive to small changes to the heuristics, leading to conversations like https://groups.google.com/forum/#!topic/clojure/GvNLOrN3lGA .
Performance for non-tuned code may be better on average but my ability to tune important areas is reduced. If the compiler was more predictable, or had a interface that allowed me to add information, or if I could make my own passes then that trade-off would go away. I'm not against smart compilers, I'm against smart compilers that don't talk to me.
⬐ pron> I'm not against smart compilers, I'm against smart compilers that don't talk to me.There are some extremely interesting advances in that area in OpenJDK. Java 9 will contain two relevant changes. The first, JEP 165[1] (fine-grained and method-context dependent control of the JVM compilers), lets you control compilation with metadata depending on context (e.g. inline method foo when called from bar); a much more interesting and powerful enhancement targeted for Java 9 is JEP 243[2] (Java-Level JVM Compiler Interface). It will do the following:
* Allow the JVM to load Java plug-in code to examine and intercept JVM JIT activity.
* Record events related to compilation, including counter overflow, compilation requests, speculation failure, and deoptimization.
* Allow queries to relevant metadata, including loaded classes, method definitions, profile data, dependencies (speculative assertions), and compiled code cache.
* Allow an external module to capture compilation requests and produce code to be used for compiled methods.
This opens the door to what I think is the most impressive compiler of the last decade, and a true breakthrough in (JIT) compiler design: Graal[3]. Graal supports languages of any level (it already has frontends for Java, C, Ruby, Python, R and JavaScript), and then allows complete control over code generation and optimization decisions at runtime. E.g. you tell it what kind of speculations to make, and it tells you which speculations failed. Unlike LLVM, you compile your language into a semantic AST (that may or may not match the language's AST) and feed it to Graal, but each node may contain not just semantics but instructions on speculation and code-gen control at any level you wish. During compilation, Graal interacts with the node and the node gives further instructions. As I understand it, JEP 243 will allow to plug Graal into the standard OpenJDK HotSpot (though at reduced speed), until Graal matures enough to become HotSpot's default compiler.
So what Graal will do is let the developer (if the language designer allows), write simple, high-level code, but tell the compiler, "listen, compile however you like, but when you get to this function, talk to me because I have some ideas on how to compile it just right".
[1]: http://openjdk.java.net/jeps/165
[2]: http://openjdk.java.net/jeps/243
[3]: https://wiki.openjdk.java.net/display/Graal/Publications+and...
⬐ jamiiThanks, that is really interesting. I'll have to look into it.
Those of you wanting to know more about this may be interested in Cliff Click's Crash Course in Modern Hardware.[1] It does a pretty good job of explaining how pipelined, superscalar, OoO CPUs came to be.1. http://www.infoq.com/presentations/click-crash-course-modern...
No argument there.BTW, for anyone interested in an overview of the non-determinism built underlying modern hardware architectures, I recommend watching this great talk[1] -- A Crash Course in Modern Hardware -- by Cliff Click, one of the world's top JIT experts.
[1]: http://www.infoq.com/presentations/click-crash-course-modern...
For an in-depth presentation on how we got to this point (cache misses dominating performance), there's an informative and interesting talk by Cliff Click called A Crash Course in Modern Hardware: http://www.infoq.com/presentations/click-crash-course-modern...The talk starts just after 4 minutes in.
> In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.Yep. There's a great talk about this by Cliff Click, called A Crash Course in Modern Hardware[1] that I would recommend to everyone. I am no hardware expert, so that talk really enlightened me.
Regarding the issue at hand, I remember Doug Lea saying[2] that some new Intel processors may recognize a loop as the OSs idle loop and power down the core. That's why he computes random numbers in busy-wait loops.
[1] http://www.infoq.com/presentations/click-crash-course-modern...
[2] http://emergingtech.chariotsolutions.com/2013/04/phillyete-s...
Summary of the links shared here:http://blip.tv/clojure/michael-fogus-the-macronomicon-597023...
http://blog.fogus.me/2011/11/15/the-macronomicon-slides/
http://boingboing.net/2011/12/28/linguistics-turing-complete...
http://businessofsoftware.org/2010/06/don-norman-at-business...
http://channel9.msdn.com/Events/GoingNative/GoingNative-2012...
http://channel9.msdn.com/Shows/Going+Deep/Expert-to-Expert-R...
http://en.wikipedia.org/wiki/Leonard_Susskind
http://en.wikipedia.org/wiki/Sketchpad
http://en.wikipedia.org/wiki/The_Mother_of_All_Demos
http://io9.com/watch-a-series-of-seven-brilliant-lectures-by...
https://github.com/PharkMillups/killer-talks
http://skillsmatter.com/podcast/java-jee/radical-simplicity/...
http://stufftohelpyouout.blogspot.com/2009/07/great-talk-on-...
https://www.destroyallsoftware.com/talks/wat
https://www.youtube.com/watch?v=0JXhJyTo5V8
https://www.youtube.com/watch?v=0SARbwvhupQ
https://www.youtube.com/watch?v=3kEfedtQVOY
https://www.youtube.com/watch?v=bx3KuE7UjGA
https://www.youtube.com/watch?v=EGeN2IC7N0Q
https://www.youtube.com/watch?v=o9pEzgHorH0
https://www.youtube.com/watch?v=oKg1hTOQXoY
https://www.youtube.com/watch?v=RlkCdM_f3p4
https://www.youtube.com/watch?v=TgmA48fILq8
https://www.youtube.com/watch?v=yL_-1d9OSdk
https://www.youtube.com/watch?v=ZTC_RxWN_xo
http://vpri.org/html/writings.php
http://www.confreaks.com/videos/1071-cascadiaruby2012-therap...
http://www.confreaks.com/videos/759-rubymidwest2011-keynote-...
http://www.dailymotion.com/video/xf88b5_jean-pierre-serre-wr...
http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hic...
http://www.infoq.com/presentations/click-crash-course-modern...
http://www.infoq.com/presentations/miniKanren
http://www.infoq.com/presentations/Simple-Made-Easy
http://www.infoq.com/presentations/Thinking-Parallel-Program...
http://www.infoq.com/presentations/Value-Identity-State-Rich...
http://www.infoq.com/presentations/We-Really-Dont-Know-How-T...
http://www.slideshare.net/fogus/the-macronomicon-10171952
http://www.slideshare.net/sriprasanna/introduction-to-cluste...
http://www.tele-task.de/archive/lecture/overview/5819/
http://www.tele-task.de/archive/video/flash/14029/
http://www.w3.org/DesignIssues/Principles.html
http://www.youtube.com/watch?v=4LG-RtcSYUQ
http://www.youtube.com/watch?v=4XpnKHJAok8
http://www.youtube.com/watch?v=5WXYw4J4QOU
http://www.youtube.com/watch?v=a1zDuOPkMSw
http://www.youtube.com/watch?v=aAb7hSCtvGw
http://www.youtube.com/watch?v=agw-wlHGi0E
http://www.youtube.com/watch?v=_ahvzDzKdB0
http://www.youtube.com/watch?v=at7viw2KXak
http://www.youtube.com/watch?v=bx3KuE7UjGA
http://www.youtube.com/watch?v=cidchWg74Y4
http://www.youtube.com/watch?v=EjaGktVQdNg
http://www.youtube.com/watch?v=et8xNAc2ic8
http://www.youtube.com/watch?v=hQVTIJBZook
http://www.youtube.com/watch?v=HxaD_trXwRE
http://www.youtube.com/watch?v=j3mhkYbznBk
http://www.youtube.com/watch?v=KTJs-0EInW8
http://www.youtube.com/watch?v=kXEgk1Hdze0
http://www.youtube.com/watch?v=M7kEpw1tn50
http://www.youtube.com/watch?v=mOZqRJzE8xg
http://www.youtube.com/watch?v=neI_Pj558CY
http://www.youtube.com/watch?v=nG66hIhUdEU
http://www.youtube.com/watch?v=NGFhc8R_uO4
http://www.youtube.com/watch?v=Nii1n8PYLrc
http://www.youtube.com/watch?v=NP9AIUT9nos
http://www.youtube.com/watch?v=OB-bdWKwXsU&playnext=...
http://www.youtube.com/watch?v=oCZMoY3q2uM
http://www.youtube.com/watch?v=oKg1hTOQXoY
http://www.youtube.com/watch?v=Own-89vxYF8
http://www.youtube.com/watch?v=PUv66718DII
http://www.youtube.com/watch?v=qlzM3zcd-lk
http://www.youtube.com/watch?v=tx082gDwGcM
http://www.youtube.com/watch?v=v7nfN4bOOQI
http://www.youtube.com/watch?v=Vt8jyPqsmxE
http://www.youtube.com/watch?v=vUf75_MlOnw
http://www.youtube.com/watch?v=yJDv-zdhzMY
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=YX3iRjKj7C0
http://www.youtube.com/watch?v=ZAf9HK16F-A
⬐ ricardobeatAnd here are them with titles + thumbnails:⬐ waqas-how awesome are you? thanks⬐ ExpezThank you so much for this!⬐ X4This is cool :) Btw. the first link was somehow (re)moved. The blip.tv link is now: http://www.youtube.com/watch?v=0JXhJyTo5V8
Cliff Click: A Crash Course in Modern Hardware is high up for me http://www.infoq.com/presentations/click-crash-course-modern...
⬐ pronAn oldie but a goodie.⬐ danthemanThis is a great presentation that goes over modern hardware. It's primarily about cache misses and their impact on performance. Below are some notes on the presentation (time - note).14:30 - cache hit take 2/3 clocks - miss to memory take 200/300 clocks - 100X cost
15:20 - in multicore you hit l3 because of bandwidth & 1 ft of wire is 1 ghz clock
18 minutes - shadow processing; kind of how the cray does ii
25:30 - out of order execution & cache miss
30 - results - 7 ops out of 300 due to cache miss
33 - miss rates are low; but a tiny (5%) missrate dominates performance
52:20 - cahce misses are hard to detect; they just look like busy cpu top doesn't help...