Hacker News Comments on
CppCon 2017: Chandler Carruth “Going Nowhere Faster”
CppCon
·
Youtube
·
3
HN points
·
4
HN comments
- This course is unranked · view top recommended courses
Hacker News Stories and Comments
All the comments and stories posted to Hacker News that reference this video.I happened to watch a talk by Chandler yesterday that might be the one the author was referring to?
Chandler Carruth has a good talk with an example of why speculative execution is critical for performance. https://www.youtube.com/watch?v=2EWejmkKlxs It starts around 36m13s time frame.
⬐ dnauticsconceivably this could be all put into the compiler.⬐ pkaye⬐ kenYou mean speculative execution? Do you know of an sample implementation?⬐ twtwConceivable, yes. Practical? Not so much.This is an idea that I personally love, but that hasn't fared well so far. Compilers are not as good as assigning instruction schedules statically as hardware can do dynamically.
⬐ dnauticscurious as to why hardware can do it dynamically while software can't. It's all logic in the end.I can understand "not being able to statically compile it because every architecture is different" but, presuming our compiler compiled to a specific platform - why wouldn't it be able to dynamically rearrange in, say, a JITted fashion using exactly whatever logic is available in the hardware.
⬐ twtwHopefully I'll put together a more technical answer in a while, but for now I'll just point out that when talking about performance, reducing things to "it's all logic in the end" makes little sense. We could emulate a modern CPU on an 8-bit micro controller, but the performance would be bad.That's a fascinating video about how modern processors work, but I don't see here why it's critical for performance. If you built a CPU without speculation, how bad would perf be? What other features could you still use? How much do common algorithms depend on speculation?⬐ pkayeSuperscalar processors have a deep pipeline with many execution units and keep a lot of instructions in flight so a penalty of a misprediction or stall is significant. Every time it reaches a branch instruction that depends on a result which is not yet available it would need to speculate or stall. Most programs consist of small amounts of compute code followed by a branch that might depends on the results of that code.
>The YouTube inteface says "auto-generated"Ok, I never noticed that. I just read the captions and assumed obvious misspellings were auto-generated.
For example in this video[1], the caption text is "L1D cache misses" but he's actually saying "L1-dcache misses". (The Linux terminal screen he's showing does display "L1-dcache".) Even though that video is not labeled as "auto-generated", I assumed it was because of the bad caption. Based on your info, I guess CppCon uses humans like Mechanical Turk or other non-domain typists to manually add the captions.
⬐ yorwbaManual captioning is almost always not done by domain experts, but by people who have some training with a captioning system and work as professional captionists. Their main advantage is that they'll caption much faster and much cheaper than having domain experts do it, but the quality tends to suffer.In college, I met a deaf guy who always had two women accompany him to lectures; one of them would repeat everything into a mouth-covering microphone to generate an automatic transcription and the other went over it to correct obvious errors. They generated a lot of nonsense, especially when the German professor was using some English loanwords for CS concepts. I was always amazed that the deaf guy still somehow managed to learn something from these garbled transcriptions.
⬐ BromsklossWhy not sign language?⬐ yorwbaI guess sign language interpreters are more expensive.
Disclaimer: I am not an expert and have not measured. This is armchair theory. But, I would argue two things.First, the former appears to have at least one unaligned arithmetic:
> 400538: mov 0x200b01(%rip),%rdx # 601040 <counter>
...while the latter's equivalent instruction is 4-byte aligned:
> 40057d: mov 0x200abc(%rip),%rdx # 601040 <counter>
So, I would argue that's the biggest source of _speedup_ in the second case. However, I'm really interested in whether that's true since I don't see a memory fence; so the memory should be in L0 cache for both cases; I have trouble believing that an unaligned access can be so much slower with the data in cache.
As for the `callq` to `repz retq`, I would venture a guess that the CPU's able to identify that there are no data dependencies there and the data's never even stored; I'd argue that it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all. Arguably. Like I said, I'm not an expert.
I'd say run it through Intel's code analyzer tool.
https://software.intel.com/en-us/articles/intel-architecture...
Tangential video worth watching:
https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be...
Edit: actually, thinking about it, it's not unaligned access, it's unaligned math. I don't think that should affect performance at all? Fun.
⬐ nkurzI'm sorry, but like the other comment at the bottom, your guesses are so far from reality that they are hard to respond to. IACA is great for what it does, but it's a static analyzer and knows nothing about alignment. L0 doesn't even exist on modern Intel processors. Memory fences would change things, but aren't part of the problem as stated. And your guess that "it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all" just doesn't have any bearing on they way processors work.Your disclaimer does indicate that you have the self-awareness that you are not an expert, but the fact that you are trying to make an argument would normally indicate that you think you understand what's happening to some extent. Rather than just guessing, I think you'd benefit from trying some things out and seeing what the results are. Play with perf, it's fun!