HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
The Mill CPU Architecture – Threading (13 of 13 & more to come)

Mill Computing, Inc. · Youtube · 116 HN points · 1 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Mill Computing, Inc.'s video "The Mill CPU Architecture – Threading (13 of 13 & more to come)".
Youtube Summary
Share this:   https://MillComputing.com/docs/threading/
Comments: https://MillComputing.com/topic/threading/
0:00:00  01  start
0:00:18  02  talks in this series
0:03:27  03 contents of this talk
0:04:01  04 The Mill ISA
0:05:53  05 What about the OS?
0:07:49  07 Mechanism vs. policy
0:09:04  08 Some philosophy
0:09:58  09 The normal thread stack
0:10:36  10 The Mill secure thread stack
0:11:58  11 Interrupts, traps, and faults
0:13:47  12 Processes – software part
0:14:16  15 Processes – hardware part
0:14:58  14 Turfs
0:19:26  19 Portals
0:20:55  20 Portal calls
0:22:39  21 Threads and Turfs
0:23:01  22 Dispatch
0:24:07  23 Dispatching
0:25:20  24 Mill Chess
0:25:57  25 Concurrency vs Parallelism
0:26:25  26 Cooperative multi-threading
0:27:58  27 Dispatching cooperatively
0:28:53  28 Preemptive multi-threading
0:29:36  29 Blocking IO
0:29:52  30 Preemption
0:33:33  31 Spillets
0:35:26  35 Stacklets
0:37:00  36 Stacklets – and portal calls
0:39:22  39 Stacklet allocation
0:41:30  40 Stacklet allocation – implicit-zero memory
0:43:17  41 Callbacks
0:43:54  42 The stacklet info block
0:46:13  43 Thread creation
0:47:25  44 Stack fragments
0:48:09  45 Exceptional unwind
0:50:24  46 HEYU op
0:52:37  47 Thread death
0:53:45  48 Thread death – reclamation
0:54:03 49  Last slide, with URLs
Questions
0:55:16 Linux? Windows?
0:56:04 Can a Mill OS host other OSes?
0:57:49 Is there a supervisor mode?
0:58:24 Are there protection rings?
0:59:06 Can the dispatch op be used outside the kernel?
1:01:00 Fixed number of hardware thread IDs?
1:02:30 Turf creation and reclamation
1:03:21 Can any thread create a turf?
1:04:24 What about killing a thread?
1:04:46 What are the regions in a Turf?
1:06:12 What are the kinds of region permissions?
1:06:44 What about simultaneous multithreading?
1:10:30 Is there a working simulation?
1:11:32 How does the Mill get 10x performance?
1:17:02 Compiler status? ILP? performance?
1:21:11 How many cycles for a dispatch op?
1:21:30 Do multiple cores share the virtual address space?
1:24:30 What about stacklet overflow for large objects?
1:25:42 Number of cycles for the various ops.
1:29:41 Single Address Space VM, TLB, and multicore
1:30:48 What about quotas?
1:32:04 What about tracking usage stats?
1:34:47 What about L4 microkernel?
1:36:33 Yes the kernel has access to all.
1:37:50  end
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
It’s still possible to have memory protection without having to do address translation. In a more extreme example, imagine every program on a system sharing the same 128-bit address space. There’s plenty of room for everyone, TEXT, stacks, heaps, etc, and if you try to step on another program’s memory you can still get a segfault. The Mill architecture sort of works this way, except it goes even further by leveraging this in a clever way to perform fast IPC.

https://m.youtube.com/watch?v=7KQnrOEoWEY

bfuclusion
Right, I can imagine if you wrap memory allocation correctly you can do that. But CPUs have finite memory, so a program can guess where it sits in the real physical memory, and from that derive a potential list of programs that already exist. You can probably even run profiling and figure out _what_ other software is running, by their allocation totals.
jhardy54
I'd be interested in how feasible this would be. Even tiny address spaces like 256 bytes have tons of potential ways to fill it (factors): a process could have : 1, 2, 4, 8, 16, 32, 64, 128, or even 256 bytes. How do you identify which are running?
MertsA
The security talk is probably a better example for your point.

https://www.youtube.com/watch?v=5osiYZV8n3U

kabdib
In the 1980s I helped design a "one gate delay MMU" for the Atari ST. It gave you relocation and bounds checking in two directions ("text" going up from zero, stack going down from infinity) in power-of-two-size increments, and we fit it into the existing DRAM controller path and timing.

Never had a chance to do a Unix port to the ST, but it would have been fun to use that hardware.

Dec 11, 2017 · 116 points, 50 comments · submitted by Avi-D-coder
saosebastiao
I've been thinking for a while about both Itanium and the Mill, both of which I feel are pretty big advancements in the state of the art for computer architectures. Itanium was a complete flop and the Mill is yet to be seen.

Explaining success or failure is always fraught with peril and overly simplified beyond reason, but I can't help but think that Intel massively fucked up one thing and one thing only: they targeted the enterprise market. Not just the commodity data center market, but the enterprise enterprise market, the we-only-buy-IBM hyperconservative and slow to move market. The type that still run COBOL on mainframes (and still call them mainframes) because it works and recompiling isn't even close to an option. Nobody in this market wants a nominal increase in computing power if it comes at the expense of backwards incompatibility.

They couldn't have seen it beforehand because smartphones weren't a thing, but a few years after launch they had their ideal target market right in front of them. Smartphones manufacturers will do anything for an incremental improvement in power efficiency. They'll take that improvement and exploit every ounce of it both upmarket in high end phones and downmarket in android burners for 3rd world countries. And people go through phones like crazy...nobody is running software on their phones that is more than 1-2 years old, and most OSes and apps have been updated at least once in the last 3 months. A recompile and migration to a new architecture for this market isn't even 1% of the hurdle that enterprise software was.

I hope that if the Mill makes it to market, that they get the market right. I'd hate to see another innovation get the shaft because of something as dumb as some marketing decisions.

rogerbinns
Your view with what happened with Itanium isn't what really happened. There is a wonderful talk by Bob Colwell (architect of the P6 / Pentium Pro) given for the Stanford EE380 course in 2004 titled "Things cpu architects need to think about". Sadly it looks like all online copies have gone. I very highly recommend the whole talk if you can find it.

Itanium was supposed to take over everything, not "enterprise". Amusingly its performance projections were based on 36 hand coded instructions from a representative inner loop in Spec, and management went ahead based on that. Even though they would leapfrog x86 in theory, in practise x86 did a steady march in performance improvements (helped by Intel's fabs). As Itanium got late, rather than cancel the project, they decided x86 was for the masses and Itanium for enterprise.

I really like that they are trying Mill, but suspect Risc V is going to soak up the dollars and attention.

chubot
I got excited about Mill a couple months ago after watching some videos (I finally understood a little bit, after seeing their materials pop up for years). It's refreshing to see a design that crosses hardware/software boundaries rather than just hacking on one side of the fence.

But then I noticed that Mill is not an ISA but a family of ISAs? (Small, Medium, Large or something like that) They are hacking LLVM so that it knows about all of the ISAs.

But doesn't this cause a problem for say JIT compilers? (JVM, every major JavaScript engine) Every single JIT compiler has to know about 4 ISAs? I get that they are similar, but that seems onerous. Debugging tools have to know about them too. Even strace is coupled to the ISA. I think the costs may have been underestimated.

Anything that changes the hardware/software boundary is already risky because you have to change two things at once. But if you're going to do that, I would think it should be a single stable interface?

In other words I think the coupling between their own compiler tech and the hardware is too close. Not everything is a portable C program. There's still people running Fortran, not to mention non-LLVM based compilers like Go.

Symmetry
I think what is supposed to happen is that the JITS generated general assembly for the whole Mill family which then goes through the standard specializer program to work with the particular model.

But you're right that changing the hardware/software boundary is really complicated and I expect that to succeed the Mill will have to spend a lot of time incubating in some sort of high end embedded role before the ecosystem needed for a general purpose application processor is there. So things like network switches, cell towers, robots, etc. The sort of thing where you might already be running Linux on top of some sort of RTOS.

infogulch
> Every single JIT compiler has to know about 4 ISAs?

Actually it's way worse, but much better.

Better first: Typically, binary programs targeting the mill don't target a specific machine, but they target a hypothetical "general mill" machine code called GenAsm [0]. GenAsm makes some generous assumptions about the hardware, like infinite belt and presence of machine instructions. Included with each machine is the Specializer [1], a program that takes GenAsm and converts it down to the specialized binary encoding for this specific machine; think of it like a linker. This includes translating infinite belt semantics into finite belts, polyfilling any missing machine instructions with microcode, etc, etc. This process is very fast and the OS can cache the result. JITs can use an API to convert generated GenAsm into runnable machine code which includes running it through the Specializer. The specializer is built along with all the other tooling based on the specification that Ivan mentioned briefly.

Now for worse: because mill machines are specification-driven, there could be many more than just 4 ISAs. There could be more ISAs than they have customers depending all on the needs in each case. But it's no big deal because everything targets GenAsm and the machine code differences will be specialized away.

I'm pretty sure it's the Specification talk [2] that goes into the most detail about this.

[0]: http://millcomputing.com/wiki/GenAsm_(code_representation)

[1]: http://millcomputing.com/wiki/Specializer

[2]: https://millcomputing.com/docs/specification/

benchaney
This doesn't seem like a huge deal to me. When you get down to it, there are a ton of different optional x86-64 extensions. Most packages target the simplest subset (No sse for 32-bit only sse2 for 64-bit). At the end of the day this is basically the same thing.
reitzensteinm
I don't think that's a good comparison. Feature flags are added every couple of years on x86, but it's fairly linear.

It's more analogous to something like CUDA, where the you are completely abstracted away from the actual assembly that's running on the GPU. Or like targeting LLVM bitcode.

benchaney
How is it abstracted away from the actual assembly? Sure, you don't know how large the belt is, but you can just use the smallest size. The entire point of the belt is that the overwhelming majority of the time you only need the front few elements. The other major difference is the presence of extra special purpose instructions. That is exactly what feature flags are primarily used for.
reitzensteinm
The amount and type of execution units will also impact the translation. Each mill cpu has a custom encoding format generated to match its resources.

x86 has essentially a bidirectional 1:1 mapping between the assembly and the bytes directly executed by the cpu. The mill does not have that.

chubot
There could be more ISAs than they have customers depending all on the needs in each case. But it's no big deal because everything targets GenAsm and the machine code differences will be specialized away

Yeah so this is the point I'm quibbling with. I have no doubt it's technically possible. I'm saying that it will hinder adoption, and they're probably underestimating the diversity of software components that generate native code, and underestimating the cost of modifying all those components.

Something like Xen succeeded because it designed up front to be a trivial modification to kernels -- i.e. paravirtualization.

This sounds like a whole new architectural element. They're not only changing the interface between the CPU and the compiler; they're also changing the relationship between the kernel and the CPU (aside from there being a different ISA.)

It's good to test assumptions, and I wish them luck. But after being somewhat excited about it, I feel it's just too ambitious. I'd love to be proven wrong though.

infogulch
To be clear, I'm just a random that's been following the mill project for a while, so please don't take my answers as gospel.

The mill is a new, novel, ISA, which will require compilers to support it as a target. There's no getting around that. But once the tooling is ready, typical programs written in high level languages like C (i.e. excluding inline assembly and architecture-specific assumptions) will be a compiler flag away from being able to distribute binaries that run on all mill chips.

If it's the specializer you're concerned about, it's intended to be very transparent to the typical user, integrated into the system. Most users are completely unaware that a thing called the linker even exists, this should be similar. By the way, just building the spec (defines belt size, available instructions, etc) generates a fully functioning specializer for that machine. I'm pretty sure it does that today.

__s
Curious how applicable the Mill is for real time use cases

Other end is that with the instruction set not being binary stable, I'm curious how well the Mill would be useful for something like Singularity https://en.wikipedia.org/wiki/Singularity_%28operating_syste... or a hypothetical WebAssembly OS where userspace programs are an IR for the OS to compile. IIRC the Mill is suppose to have it's own IR for program portability

Binary translation viability is key if they want to support Windows-- see current Windows 10 for ARM

SAI_Peregrinus
4:30 in the linked video or so: it's statically scheduled, in-order, and all opcodes have fixed execution latency. Should be good for real time.
ema
There is still variable timing depending on whether the needed data happens to be in cache or not. Not sure how hard it is to code in such a way that it is predictable what data is gonna be in cache at which times.
snuxoll
DRAM access is always going to be variable by nature, though they reduce the problem somewhat by never reordering loads and stores. Stores will always write back to D$1 cache and only hit DRAM when a line needs to be flushed from the last level cache, so assuming your data and code fits in cache you can theoretically have 100% determinism (although all of the specific latency numbers are model-dependent, so just like a traditional DSP you'll have to tune for each target).
SAI_Peregrinus
It's also worth noting that hard real-time systems tend to be custom designed, and often for high-budget products like medical devices or test equipment. So it is probably possible to add a bunch of SRAM to the board, and guaranteed latency to that memory is very easy.
Quequau
I've only been paying a little attention to this but it does seem pretty interesting and maybe even promising.

Have they made any indications about production time lines recently?

xfer
AFAIK, their goal is to produce novel patents.
wtallis
Their goal was to produce chips. Then the America Invents Act was signed and they were forced to become much more patent-focused in the short term or forever abandon any hope of patent protection for their works.
pohl
Despite not being a HW expert, I've watched each one of these videos carefully. (Whenever a new one comes out it's like Christmas to me.) Ivan has been pretty consistent on this message each time this topic comes up. They're being careful to protect their work before they disclose it, yes, but they have always said that they're in this to produce hardware. If you've encountered something that contradicts this, I'd like a citation.
petermonsson
Don’t keep your hopes high. The first video is 4 years old.
Quequau
Well, I don't think it's possible to design a novel general purpose processor overnight, so I'm not surprised that it's been four years coming. On the other hand listening to this talk I walked away with the impression that they're further away from making real chips (even engineering samples) than the last talk.

Whatever the case, I do hope they make it to market but maybe that's just my own morbid curiosity.

loup-vaillant
I personally totally keep my hopes high. I just don't hold my breath.
ema
IIRC at the time of the first video they had already worked on it for around a decade. Personally I'm hoping that just means they have enough stamina to see it through and not that they'll never finish.
snuxoll
To be fair, these guys aren't Intel or even AMD or ARM - they're a small team with limited funding. I'm not going to be shocked of Mill themselves never releases production silicon, but worst case their novel ideas will only be under patent lock-and-key for a period of time - someone else will have the opportunity to make use of them eventually.

With that said, I'd love to see these things come off a fab line someday - there's a lot of potential in the ideas behind the Mill architecture, whether they'll pan out or not is to be seen, but if they fail I'd rather see another Itanic than to never make it to market in the first place.

callesgg
They don't want to produce it themselves, they want to sell it as a license. I read(or heard) somewhere.
CarVac
In the early talks Ivan said they intend to produce it themselves.
henrikeh
I concur. Been watching the videos for the past month and he repeatedly states that they are aiming at producing and selling chips, not IP.
monk_e_boy
Is there a video that explains the Mill CPU in detail but isn't 15 hours long? I could go for 2 or 3 hours.
ithkuil
FWIW I wholeheartedly recommend watching all the material about the mill CPU; I found it deeply refreshing and illuminating. (disclaimer: I'm a software engineer, with basic knowledge about hw)
monk_e_boy
I've watched a few. But I don't need 15 hours of info, 3 is fine :) I'm not THAT interested (yet!)
Scaevolus
This forum post is the best summary I've seen: https://millcomputing.com/topic/introduction-to-the-mill-cpu...

It links to videos with longer explanations.

loup-vaillant
I've seen a number of those, and I'm pretty sure there's no way to condense a detailed explanation down to 3 hours while still being accessible to non-specialists. Even the current videos all begin with a "gross oversimplification" disclaimer, so they're probably not very detailed to begin with.
wtallis
I think it would be possible to describe everything novel with the Mill in three hours (at least the stuff that's been made public so far). It probably wouldn't be possible to even begin covering the implications: how to use the novel features to achieve better performance or security.

They did an entire talk on how to translate a switch block into Mill assembly; that talk didn't really introduce any new hardware features, just describe how to use them.

loup-vaillant
Probably. But you'd have to be a specialist to infer the implications. I gather they wanted to be accessible to a wider audience.
mcshaner1
There are a couple shorter videos with Hackaday, Mill CPU for Humans.
Symmetry
The wiki is probably the best resource then?

http://millcomputing.com/wiki/Architecture

EDIT:

But to summarize:

This is an exposed pipeline VLIW design except that the instructions are variable width so it doesn't really fit in the traditional conception of VLIW. There are a bunch of clever tricks for compressing the instruction stream and minimizing fetch bandwidth. Instead of registers recent results go into a static single assignment mechanism called the belt where the last N results are visible. To better handle memory speculation register contents are essentially wrapped in Maybe monads in a way that cleverly gets around the Itanium's problems with memory speculation. To better handle data pressure new pages can be declared in the cache hierarchy full of zeroes and are only assigned backing DRAM when ejected. The caches all work in terms of virtual addresses, translation to physical only happens at the DRAM interface. The stack is managed in hardware with its own dedicated queue, the Spiller, which will start pushing data to L2 on its own as it starts to fill up. Code for the Mill is first written to something called general assembly which assumes every possible instruction, infinite execution width, etc. A single pass specialization step goes through and performs necessary substitutions when an executable at the dynamic linking stage.

None
None
lasermike026
Do not meddle in the affairs of wizards, for they are subtle and quick to anger.
tromp
I remember a variation ending with "for you are crunchy and good with ketchup"
teddyh
That’s dragons, not wizards.
justin_vanw
Why are we talking about this forever? Can't someone just make one of these, demonstrate that they don't work well in practice, and then we can stop wasting time on it?
db48x
Are you volunteering?
CarVac
They are only talking as quickly as they can get patents filed. Once the patents are in place then they'll get to taping out, I presume.
justin_vanw
Filing a patent takes about 2 months, they've been on this nonsense for 4 years or so.

Edit: 14 years...

kinghajj
IIRC, they had something on the order of ~50 patents, which they only started on 3-4 years ago. The first 10 years of the project were entirely spent on thinking up of the new, patentable ideas that go into the thing. As Ivan says, you can't put a schedule on insights--they happen when they happen.
rcxdude
Not to mention this is still basically a spare time project for all of them.
justin_vanw
Well I guess I'll wait another 14 years before I comment again. I'm sure you'll all still be here pretending this is a real thing.
phaedrus
Lately I've been thinking that something inspired by the Belt idea could be applied to making an innovative homebrew 4-bit TTL CPU. The concept is, feed each of the 4 output bits from a 74181 ALU into a separate chain of 74HCT595 shift registers. (Instead of an implicit accumulator, every machine code command would be an implicit push-result-into-belt.) Taking a slice of each parallel data out pin across the shift register lanes, you get a "belt" slot. I.e. the 4 shift register chips' d0's give you ALU out from time t-0, the 4 d1's give you ALU out from t-1, etc. The effect is that instead of an accumulator register, you have 32 bits of ALU output history (or 64 bits or more if you chain more than one 74595 for each bit). Then, having created an ersatz very wide demux, connect all 4x8 or 4x16 lines to a group of 74151 mux's to get back down to 4-bit register bus values - with the capability that different groups of mux's can independently point to any nibble in the history-memory.

Although the same end result could be accomplished without going through a 4x16 "bit matrix" or "crossbar", the setup has some nice properties, particularly for a hobbyist TTL CPU:

* Generating 16 bit address lines and 8 or 16 bit memory-data IO lines could be done by grabbing 2 or 4 nibbles at a time.

* You could drive LED's or a (decoded) hex 7 segment display directly from the bit-matrix lines to see all N history values.

* If you have at least two mux units, the "A" and "B" inputs of the ALU chip could be pointed at different history values.

* The two mux units could be ganged to feed the wider RAM-address and RAM-data registers.

* When doing math or logic on 8, 16, etc. bit values, you wouldn't have to change the register selectors (mux addresses) to change nibbles: the act of pushing the result moves the next operands into position under the "tape head".

That last point means that a fairly simple TTL circuit could flexibly support 4,8,12,16,...,64 bit ALU ops (provided enough 74HCT595s were connected to provide 2x that many bits total). Just set up the initial data and operation, load a TTL counter with the desired number of nibbles, and let 'er rip at the max speed the 74181 can handle.

I call this "The Suspenders" CPU architecture.

loup-vaillant
Sounds nice. I just have a little problem: hasn't the 74181 been discontinued? I never found a modern equivalent —unlike other components such as multiplexers, shift registers…

I'd like to make a TTL CPU, but the more I look at it, the more it seems I'll have to make my ALU without the 74181 —I don't like relying on discontinued parts that may be hard or impossible to source in the near future.

ChuckMcM
This is the kind of experiment it is fun to use small FPGA evaluation boards to run. (or a larger CPLD).
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.