HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
"Mill vs. Spectre: Performance and Security" by Ivan Godard

Strange Loop Conference · Youtube · 80 HN points · 0 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Strange Loop Conference's video ""Mill vs. Spectre: Performance and Security" by Ivan Godard".
Youtube Summary
The Meltdown and Spectre attacks, disclosed last year, have upended the industry. With them an attacker can read any location in memory and extract the secret content at high rates. The attacks are unique because they gain access, not by exploiting some bug in application or kernel code, but through a fundamental architecture design flaw in most modern commercial CPUs. Working around the flaw reliably can cost a third or more of program performance.

The keyword above is "most". General purpose CPUs today commonly use Out of Order (OOO) scheduling and speculative execution to obtain higher performance. Unfortunately, Spectre and Meltdown have revealed that the increase in speed provided by OOO comes with an inherent cost: total loss of security. However, not all CPUs use the OOO architecture. Many low-end architectures that are more concerned with power usage than speed use an older approach, In-Order (IO), and eschew speculation. Such chips are inherently immune to Meltdown/Spectre. In fact, the microcode workarounds applied to OOO machines to prevent these attacks in effect convert them into IO machines - that run at In-Order speed while using OOO power to do it.

There is an exception to this gloomy news. The Mill architecture was designed from the beginning to provide OOO performance on an IO power budget. It does no hardware speculation and so, serendipitously, is immune to Meltdown and Spectre. That's the easy part - a Z80 does no hardware speculation and is immune too. The hard part is getting the performance of speculation without opening security holes. The talk will explain the security problem, show why the Mill is immune, and will lightly address why Mill performance does not need OOO.

Speaker: Ivan Godard
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Oct 19, 2018 · 71 points, 31 comments · submitted by leoc
evancox100
(Started watching at 24:30, thanks y'all).

Everything he's saying misses the mark. If the only issue was hiding the memory latency of a load when you know the address, you could solve this with existing techniques like simultaneous multithreading (a la HyperThreading), prefetch hints, etc.

The need for speculation arises when you do not know ahead of time which address to access, which branch to take, etc. For example, you're accessing an element in an array and need to multiply the index by the element size. You don't know which address to load until the multiply completes, so you speculate. I don't see how the Mill's deferred load semantics help you any more than a prefetch or dummy load would. Actually, unless I'm missing something you couldn't even use the deferred load because, again, you don't have the address.

snuxoll
> The need for speculation arises when you do not know ahead of time which address to access, which branch to take, etc. For example, you're accessing an element in an array and need to multiply the index by the element size. You don't know which address to load until the multiply completes, so you speculate. I don't see how the Mill's deferred load semantics help you any more than a prefetch or dummy load would. Actually, unless I'm missing something you couldn't even use the deferred load because, again, you don't have the address.

You kind of hit three different issues here, there's three completely different scenarios I can think of off the top of my head to cover and the Mill design ties with out-of-order designs in the worst case and beats them in the other two.

1. Random I/O on array elements - nobody wins here because branch prediction and speculative loads will consistently fail, you hope your data is in cache and everybody stalls if not.

2. Sequential I/O on array elements - Mill can perform equally to an out-of-order design in most cases and beat it in others, you don't rely on the CPU seeing far enough ahead to reorder loads and have much better facilities for parallelizing common operations (their strstr example using their smear instruction, NaR values and pervasive vectors is truly mindblowing).

3. Switche statements with jump tables, the Mill's wide-issue design handles many of these cases without needing jump tables to begin with, especially when paired with speculative operations on potential NaR's. When you need to call code at another address you are again at the hands of the branch predictor and instruction prefetch, which the Mill does do and has some novel designs for that provides a low mispredict penalty and purportedly better prediction results. Ultimately though, if you keep hitting mispredicts you're in the same worst-case as you have on out-of-order designs.

The Mill can't beat out-of-order designs where your code just thrashes cache, causes mispredicts all the time, etc, but it can match them without eating gobs of power.

twtw
You talk about the mill as if it exists. There is no hardware, there are no benchmarks. Bloviating about the excellent performance of the mill is not valuable - showing SPEC CPU results is. VLIW performance was great too, until it wasn't. You can statically schedule everything in theory and the performance will be great, but experience suggests that giving hardware the capability to react dynamically cannot be replaced by static scheduling, except in code with limited branching and a known execution pattern. This is why VLIW works nicely for DSP, and fairly poorly for general purpose computing.

The mill has been in development for 15 years, and almost done for 5. Forgive me for not holding my breath.

youdontknowtho
It's the myth of the perfect system that is an underdog to industry giants whose products are inferior but dominant for (insert reason).

It's like: 1. rewrite it in rust 2. Plan 9 3. Functional Programming for everything. 4. Lisp. 5. There are more, but you take my meaning...

wtallis
> Forgive me for not holding my breath.

You're the one who chose to click on the link to enter this discussion. You know that the Mill isn't in silicon yet and you're personally only interested in things that are, so why are you here? You're just trolling while other people are trying to have a productive academic discussion.

twtw
This "productive academic discussion" of the myriad benefits of the mill architecture has been repeated over and over again for at least 5 years, with very few new developments. There is great value to thinking about new and non traditional architectures, but discussion around this particular venture is pretty tired. I don't know that much more discussion is valuable at this point without some evidence.
None
None
wtallis
> has been repeated over and over again for at least 5 years, with very few new developments.

The same is true of discussions of Intel's architectures; they've only released one new microarchitecture in the past 5 years. Hardware development is slow, even for the people who are already shipping silicon.

sparkie
The difference is Intel have a proven track record of producing actual products. Mill Computing have not produced anything tangible yet.

The strategy appears from the outside to be one of aggregating IP in the hope that they'll license it (like ARM) or will get acquired for some big amount.

I hope I'm wrong, but I'm not expecting to be able to pick up a Mill CPU in the next 5 years. Maybe even 10.

angry_octet
It isn't a productive academic discussion if you're gatekeeping views that don't match your own, yet are also technically informed.

If there were results from an FPGA synthesised version of the Mill there would be less scepticism. But as it is, the Mill is just a design, and claimed performance/features require more evidence than for an existing architecture.

deepnotderp
I don't understand why people are so willing to say "it won't work" without actually taking the time to understand it. They literally spend like every one of their talks addressing how they overcome traditional vliw problems.
None
None
None
None
evancox100
He's not saying it won't work, he's saying it doesn't exist yet, so talking about it as if it does is a bit silly.
__s
24:30 to reach info about Mill architecture, beforehand is building context by explaining Spectre
analognoise
Does this thing even exist on an FPGA yet?
Veedrac
No.
jcranmer
Does anyone have a link to the slides? I find that a much preferable way to access this sort of stuff...
leoc
Presumably it will show up at https://millcomputing.com/docs/ eventually but it doesn't seem to be there yet.
gizmo686
Discussion on Mill begins at about 24:30.
ptc
This Mill guy is the gift that keeps on giving. With any luck he’ll still be around to explain how the soon-to-be-released Mill 1.0 cpu would have avoided the year 2038 problem.
gbrown_
> With any luck he’ll still be around to explain how the soon-to-be-released Mill 1.0 cpu would have avoided the year 2038 problem.

What? Software working with a 64-bit time_t is not the CPU's problem.

twtw
Talk is cheap.

The TL;DW is that the mill cpu will have better performance than existing CPUs without speculative execution because it has "deferred loads," while the straw man not-mill architecture doesn't and therefore stalls after every load. Also, newsflash - Spectre doesn't impact architectures that don't speculate.

This is great, except for that existing CPUs don't stall after issuing a load. Scoreboarding + prefetch are together capable of more than this "deferred load," and require less work from the compiler. If you have independent instructions following a load, any existing architecture worth its salt will notice that and execute them while the load is in progress.

It's potentially a neat idea to include the number of cycles until load retire in the instruction, but it's a joke to pretend that it's higher performance than what x86 does and will get you back all the performance lost by not speculating.

I can't help but think that the mill architecture gets a lot of hype from a lot of people that don't know very much about computer architecture. There have been lots of great ideas that didn't pan out for general purpose computing, and I'm not sure that this vaporware architecture deserves to be thought about.

deepnotderp
> This is great, except for that existing CPUs don't stall after issuing a load. Scoreboarding + prefetch are together capable of more than this "deferred load," and require less work from the compiler. If you have independent instructions following a load, any existing architecture worth its salt will notice that and execute them while the load is in progress.

<Citation needed>. BTW, deferred loads were rediscovered as "decoupled load" (http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf) and achieved a respectable 8.4% avg speedup.

> It's potentially a neat idea to include the number of cycles until load retire in the instruction, but it's a joke to pretend that it's higher performance than what x86 does and will get you back all the performance lost by not speculating.

That's not at all what's claimed. The entire idea is that you try to approximate OoO performance, not beat OoO performance.

> I can't help but think that the mill architecture gets a lot of hype from a lot of people that don't know very much about computer architecture. There have been lots of great ideas that didn't pan out for general purpose computing, and I'm not sure that this vaporware architecture deserves to be thought about.

I mean, it's been discussed as a valid idea by people on the P6 team (one of the first commercial OoOs), but what would they know? ¯\_(ツ)_/¯

twtw
"<Citation needed>"

This is a joke, right? Or are you actually calling me out for no evidence on a discussion of the mill architecture, which has been proclaimed for a decade as the greatest thing since sliced bread without a shred of supporting evidence?

deepnotderp
You claimed that deferred loads help less than prefetching and scoreboarding. You should substantiate that claim.

In any case, see the linked paper for why you're probably wrong on that point.

petermcneeley
"I can't help but think that the mill architecture gets a lot of hype from a lot of people that don't know very much about computer architecture. There have been lots of great ideas that didn't pan out for general purpose computing, and I'm not sure that this vaporware architecture deserves to be thought about."

This architecture is basically itanium++ which was a very serious arch but didnt make it (just like PS3 Cell). To ask if this arch has any possible future one should really ask why itanium didnt succeed.

snuxoll
> This is great, except for that existing CPUs don't stall after issuing a load. Scoreboarding + prefetch are together capable of more than this "deferred load" [...]

Except existing CPU's spend a lot of die space and power budget on doing speculative execution to hide the stall, the point of a deferred load is you don't need all this hardware to extract the same performance.

> and require less work from the compiler. [...]

Three words: static single assignment. If you can work out the dataflow of a function you already have everything you need to order loads in the most efficient way possible, this is why all of Mill Computing's work has been around LLVM because LLVM IR forces SSA by design. Hell, your compiler doesn't even need to think about the ordering if it relies on LLVM to do the native code generation, because the Mill backend is supposed to do all of this for you.

> It's potentially a neat idea to include the number of cycles until load retire in the instruction, but it's a joke to pretend that it's higher performance than what x86 does and will get you back all the performance lost by not speculating.

Deferred loads alone aren't there to beat x86 in terms of performance, they're there to avoid needing all the costly out-of-order hardware while avoiding memory stalls that previous statically scheduled/in-order machines incur. There's other features in the architecture to bring better performance, but that's all around the VLIW-like design.

jcranmer
> Three words: static single assignment. If you can work out the dataflow of a function you already have everything you need to order loads in the most efficient way possible

The ordering of loads has almost nothing to do with dataflow (with the slight caveat that data dependencies from loads guarantees a small amount of the memory order). I'm speaking from experience here, any computation DAG model is going to very quickly run into the problems of dealing with branches and the inherently undecidable problem of static alias analysis.

gpderetta
As far as I know (and I don't know much because I'm not a compiler guy) LLVM (and most compilers) doesn't keep everything in SSA form. Any value whole address has escaped (most of the things not in the C stack and evem some local variables as well) must treated as memory. I think that not automatic, not recently used variables would also be the values that would benefit the most from deferred loads. So IIRC Mill has hardware to help with aliasing but it wouldn't in fact plug out of the box in LLVM.

Do the Mill guys even have an LLVM backend yet? Or even any compiler at all?

Veedrac
I believe the latest we've heard is that their LLVM backend is mostly working but still pretty buggy.
UnquietTinkerer
For anyone interested, here are links to the slides and the accompanying white paper.

[Slides] https://millcomputing.com/blog/wp-content/uploads/2018/04/20...

[White Paper] https://millcomputing.com/blog/wp-content/uploads/2018/01/Sp...

I haven't read the paper yet; hopefully it offers more detail than the talk does because I am still confused about how the Mill avoids cache pollution from speculative loads.

EDIT: Here is my attempt at a summary of the relevant bits of the whitepaper:

The Mill is immune to Meltdown for the same reason AMD et al. are; it does permission checks before loading rather than in parallel and thus the load faults before going to memory.

The Mill is immune to Spectre because "Current Mill configurations will [speculatively] issue, and revoke, a maximum of two instructions. Revocation includes all cache and other micro-architectural side effects."

Neither of those points is covered in the talk. I don't know enough about the subject to judge, but the arguments in the paper seem a bit glib. I'd like to hear from an expert on the subject.

strstr
I’m pretty surprised if they don’t leave speculatively loaded (and still correct) data in the cache. My understanding of speculation is that was sort of the point: often you won’t compute the right value (because you have to be right in every instance) but you will have loaded nearly all of the relevant data into the cache, so it’s comparatively fast the second time around.
Symmetry
I'd be very surprised is they didn't too. But Spectre isn't just about what's in cache, you have to load secret data and then do another load with a location based on that secret data before the the mis-predicted branch is caught. The number of clock cycles from branch prediction to branch resolution on the Mill is just too short for you to do all of that, just like it is on most in order architectures. Just loading the secret data into cache isn't enough to be a problem. You already knew its address if the attack is going to work.
Veedrac
This argument holds better for an OoO CPU that is speculating 100 instructions ahead, so there's significant work done in this window. When your speculative execution is only 2 cycles ahead, you aren't throwing away much work; you'd be lucky to even have work to throw away by that point, at least as it applies to cache misses.
Oct 18, 2018 · 6 points, 0 comments · submitted by espeed
Oct 13, 2018 · 3 points, 0 comments · submitted by leoc
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.