HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
"Uptime 15,364 days - The Computers of Voyager" by Aaron Cummings

Strange Loop Conference · Youtube · 231 HN points · 6 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention Strange Loop Conference's video ""Uptime 15,364 days - The Computers of Voyager" by Aaron Cummings".
Youtube Summary
The Voyager 1 and Voyager 2 space probes, both launched in 1977, each had a primary objective to explore Jupiter and Saturn. This goal was achieved by 1981. Yet Voyager, NASA's longest running mission, has continued to this day. Both Voyager probes are still operating, and returning scientific data from outside our solar system.

This talk explores the computing systems of Voyager - the systems which enable remote control of the spacecraft, and provide for the recording and return of data to Earth. These systems have proved to be both adaptable, durable, and resilient in support of a scientific undertaking now in it's fifth decade.

What can we learn from the engineering of Voyager's computing systems? Why have they survived for so long in the harsh environment of space? What is involved in patching a system from a billion miles away? And what does the future hold?

Aaron Cummings

Aaron Cummings is a software developer working in the semiconductor industry, currently leading a team working on tools for building and testing embedded memories. He has had a long term fascination with the space program, and has been interested in Voyager since seeing the pictures returned from Jupiter and Saturn in the 1980s.

Chapter listing

00:00 Background on Voyager
01:16 Talk outline
02:00 About the speaker
02:25 Origins of Voyager
03:18 Initial requirements
08:25 Revised program
10:05 The 3 R's
11:08 Probe design
12:55 Command Computer System
17:40 Attitude and Articulation Control System
20:04 Flight Data System
23:56 Voyager 0
24:35 Voyager 2 mission
29:07 Sustaining power
30:26 The golden record
31:20 Q&A
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Aug 24, 2022 · 8 points, 2 comments · submitted by behnamoh
That was 3 years ago :)

Previous discussion

Worth watching

Episodes 5 and 6 of Blaine Baggett's "JPL and the Space Age" documentary are going live this week as part of Voyager's 45th anniversary celebration

Episode 5: The Stuff of Dreams

Episode 6: The Footsteps of Voyager
StrangeLoop talk: "Uptime 15,364 days - The Computers of Voyager" by Aaron Cummings
Dec 27, 2021 · 7 points, 0 comments · submitted by simonebrunozzi
Jun 10, 2021 · 5 points, 1 comments · submitted by fagnerbrack
Took time to watch it. Truly mind blowing achievement.
Jun 07, 2021 · 4 points, 0 comments · submitted by cyunker
In Sweden between 1930-1970 we produced building elements from concrete where oilshale was used to burn the chalk and the ashes of the fuel was mixed into the concrete!

That shale ash was so rich in uranium that the concrete turned blue! Since then thousands have died by passively breathing the radon evaporation from the walls in some Swedish houses.

On the subject RTG elements are also used in space probes: 5:35 he talks about the plutonium generator.
Dec 07, 2019 · 6 points, 0 comments · submitted by bathtub365
a recent talk on the spacecraft and the project —
"Uptime 15,364 days - The Computers of Voyager" is a pretty good introduction.
Oct 18, 2019 · 201 points, 56 comments · submitted by big_chungus
I wonder what the 'safest' uptime possible is today, for a computer connected to the internet? e.g. what's the oldest linux kernel that has no known remote attacks (not just remote exploits but DOS weaknesses too) ?

To make it more difficult, what would be the safest uptime for a box that allowed remote logins? SSH flaws don't count, since you can always upgrade that on the fly, but kernel-level privilege escalation weaknesses would count as critical.

It depends on your definition of a computer. A bare-metal embedded system could be fine without any software updates, ever. Some of them are in-place sensors for weather/temperature/whatever. Think of an Arduino with an Ethernet shield, that measures air temperature, encrypts with AES and sends it over plain TCP connection. There is no port is open for listening, and there is nothing to hack into. The only possible vulnerability is OTA update, but in some cases you might purposefully avoid that. For example in the rail industry, the only way to update their devices is to physically change the hardware, and that is done intentionally.

The hardware might be sleeping for 99% of it's service life, and is typically over-engineered. They could run into the next century if corrosion doesn't get them.

Client side vulns do exist. Your model is not complete.
I am not claiming they don't, but often these devices only run thousand lined of code, not millions. The security challenge is much, much smaller than securing the Linux kernel.
If it's doing TCP then there's a whole network stack to attack. I bet there's a big range of embedded systems that can be crashed because they have a simplistic cut-down, non-hardened TCP/IP/Ethernet implementation that can be abused. You've got a chance of breaking it through sending malformed packets to cause a panic, or just exhausting its memory - which might cause it to reboot (maybe lots of fragmented packets?)

The lack of listening ports might shrink the attack surface, but a malicious endpoint that it connects to might be able to confuse it. (or perhaps some evil MITM attacker)

Malicious endpoint is certainly an interesting scenario, let me work through a few possibilities:

1. Wiznet produces chips where the entire network stack is implemented in hardware, there is no code. Malformed packets will never touch my code. However, this is a minority of devices.

2. These devices typically have no dynamic memory allocation at all, they will work with a circular buffer of fixed size. Normally they can't run out of memory.

3. You might be able to clog up their CPU, however they typically use an RTOS, those operating systems will place hard limits on amount of CPU time different components of the system could take. At best, you will prevent the device from sending out anything over the network while you are attacking it.

4. Of-course, if there are mistakes in the TCP/IP stack, you might cause issues. Most of them are using lwIP stack, I am not qualified to comment on it's security.

Some networks are much harder to hack than others. A friend was working on power meters that use the HV electric grid to call home. They did not ignore network security, but as this was just data collection and they did validation for unusual data they assumed it was a negligible issue.

Granted while the data went to the internet it’s arguable how internet connected these devices where.

I bet that these devices are local only. Working in IT the security measures for these devices are likely that they are sitting on their own isolated VLAN. Routing is likely controlled by layer 3 switches with dedicated ACLs to only allow certain systems to obtain access to that VLAN. If it's a rail company the network itself is likely protected by an Cisco ASA or ASA style dedicated network security appliance. Networks have been hardened pretty well. Most often the exploits that IT is worrying about is application-based and user-based (social engineering).
While this is best case, I've seen PLCs and SCADA controllers for systems that will most definitely kill you (industrial control, power utilities) connected to the public internet. Airgaps? VLANs? Properly implemented network and auth controls? Not everyone cares as much as they should.
Network hardening fundamentally has sone serious limitations because it doesnt expose the principals to the devices controlling access.
It's probably the best way to secure them - but I think that defeats the spirit of the 'connected to the internet' part of this uptime security challenge :)
With kernel upgrades in place is this an issue anymore?
In-place kernel upgrades are cheating wrt uptime :)
No real reason, just that they potentially give you unlimited uptime. I know that's a good thing, just not for this particular silly challenge :)

Although, in-place upgrades haven't been around all that long in the grand scheme of things. Perhaps there's a box out there that predates in-place upgrades and is still running securely?

Depends how long you're willing to backport security fixes. Even Red Hat only goes ~15 years.
Ask your bank what the uptime on their mainframe is, with good chance, it'll be higher than anything else you've seen (unless power outage).
I'm a Natwest/RBS customer. I'm going to guess their uptime is about three weeks.
My experience is literally that 1/3 of my logins result in "our system is down right now". Probably because I often want to check my account in the evening on a weekend.
As someone who works in finance, I can tell you it's going to be less than a year. Although uncommon, those things sometimes reboot as part of maintenance. For example, we upgraded our COBOl compiler and all got an email that for like half an hour after midnight in a weekend the mainframe would be down for maintenance.
We rebooted system as part of training every two years, new staff needed to know how it is done :-)
Why would you limit yourself to Linux: it has a very large "surface" area so new vulnerabilities are found all the time. It has the benefit of large number of eyes on it, so gets patched quickly (and can be live-patched for most part), but it's not a good candidate for super-long uptime.

I would imagine something small and stripped down serving a particular purpose would fare better. And OpenBSD prides itself in security for general purpose computing, but it still gets regular security fixes.

Linux was just an example, in fact I'd guess that there's probably a lot of BSD variants out there with ridiculous uptimes.

I think the 'allows user access' bit is the thing that would limit safe uptimes the most, since once you're logged in to a box, the kernel surface area for attacks is much larger.

It's not immediately obvious to me that the OpenBSD networking stack has a smaller remote surface area out-of-the-box than Linux. That aspect of OpenBSD isn't really stripped down, at least not relative to the default state of other operating systems.
Don't look into BSD, you gotta look at mainframes. Some of the mainframes at banks have been running since they bought the very first one in the 90s or even 80s. Since stuff like VMS allowed to simply clustering, you could simply add modern machines, transfer over and shutdown the old hardware without having to shutdown the system itself. These are probably the only machines with a chance to reach 30+ years of uptime.
But the topic is safe uptime for a computer connected to the internet. I have no doubt a mainframe has a chance to reach 30+ years of lifetime but I also have no doubt it'd also be done for if attacked.
Dont forget some of those systems are highly firewalled to middle-men who then send back data to the web or other UIs.
Many if not all of these mainframes have been fully upgraded or at least serviced due to hardware failure while still running.

I’m not entirely sure how you define uptime for these machines if none of the original parts are still there.

Trivially: 100%
Philosophers ask a similar question to yours:
I was hinting at that, which is why we should define an uptime of a system rather than a machine because with distributed systems that uptime of the system isn't dependent on the uptime of a single "machine" and a mainframe is a distributed system even if it's in a single rack.

The question is then where you define the boundaries of a system and it's uptime. At least from my recollection for mainframes they defined uptime based on the execution of batch jobs and availability of services not the OS/Hardware which if it crashed often involved Big Blue coming to investigate WTF happened and how it happened since System Z machines are designed with so much redundancy that you can swam RAM modules without interrupting the workflow.

Today with RAIM (RAID for Memory) IBM System Z machines even support an entire memory channel dying without interruption.

That says something about the hardware more than about the software, IMO.

In a smallish embedded system there are not too many (software) difficulties in keeping the software running virtually forever.

Getting the hardware to keep running without any fault for 42 years, on the other hand...

One of the main things I remember about these old CCSs is the low level hardware redundancy:

> The Viking CCS had two of everything: power supplies, processors, buffers, inputs, and outputs. Each element of the CCS was cross strapped which allowed for “single fault tolerance” redundancy so that if one part of one CCS failed, it could make use of the remaining operational one in the other. [1]

Modern systems like that of curiosity rover also use hardware redundancy, (triple redundancy even), but I believe this happens at a much higher level, i.e whole computer.


Is there somewhere leader board for uptime? I found subreddit on topic
In that subredit there are usually a lot of reports from network devices. Achieve a large uptime in a server in which the people do things all day, it's difficult.
Being proud of not having patched a network device for 15 years is a great way to get in contact with security and HR at my company. I can't imagine thinking it's a great thing to go and brag about on the internet lol, different worlds.
Great video! My only nitpick was when Aaron said the camera resolution (800x800) was 640 MEGAPIXELS. I can understand why he misspoke. Everybody today uses megapixels as the measure of pixel density, but back in the mid 1970's, digital cameras did not yet exist and the resolution of the onboard image orthicon tube was actually just 640 KILOPIXELS.
Does anyone have any good resources on how to design systems like these? I find the idea of computers that have to work for decades, be autonomous and self reparing really exciting.
That's around 12 years more uptime than I have - impressive!
Don't you reboot every night? :)
Sleep mode only, a reboot wipes all volatile memory :P
Pffft and lose all my tabs!

Seriously, at this point I have multiple browsers open, multiple tabs, multiple programs split over multiple desktops with my workflows.

But more seriously an encrypted drive that has a lot of data on it that I've forgotten the password for, and can't remember the system I used to encrypt it / set it up, and figuring out how to change it is always pushed to future LandRs problem. A restart and I'm screwed!

I'd basically just give up and go live in the woods.

I know that feeling. I've got an old mac mini in a remote server room, with limited on-site access. It's been up for 2.5 years, going through various ubuntu releases and upgrades. I'm afraid to reboot it because it initially had a strange boot setup, and there's been enough changes now that I'm not sure it'll come back up. So I keep delaying the inevitable, and hope it'll last until I need to get a new machine :)
Reminds me of this
There are companies in that situation too. They live in perpetual fear of power failure or hardware crashes. It's exactly the sort of thing we're on the lookout for during technical due diligence. Anything that you are afraid of rebooting is a risk that needs mitigation while the system is still up and running.
A voice of reason. There is no point to having machines that have to stay up at all costs. You didn't design something correctly.

I use Kubernetes every day and the cluster we're on is not the most reliable and yet, because of Kafka and how pods in K8s restart themselves, we have yet to lose any data or fall behind in our process because we designed it to be resilient to failure. It's sad, but that is our reality.

Clearly Voyager could not accomplish that with its packaging, so it is that much more impressive.

"Premature optimization is the root of all evil." -Donald Knuth
I strongly dislike that his words are always distorted by taking out of context this small segment of what he said.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail." - Donald Knuth

It brings to mind a quote from Ralph Waldo Emerson that is often abused in a similar manner: "A foolish consistency is the hobgoblin of little minds." It is interesting to observe the results of someone treating "foolish" as a filler word that can be glossed over. For our purposes here, substitute "foolish" with "premature". Leave that word out, and more context is needed. But that's why Knuth didn't leave that word out. With one simple adjective, the statement stands as is.
To be honest, I don't think taking the sentence out of context distorts much. This quote (which I see in full length for the first time) pretty much says how I always understood the shorter version.

It's not "optimization is root of all evil". The key is "premature optimization". Maybe people gloss over that part, but it is right there.

Yes, Knuth goes into more detail on what he considers premature optimization in the context of programming computers. However the short sentence applies much more broadly in my experience.

For example, "premature optimization" of BOM costs in a hardware project can cost you dearly down the road when it turns out that leaving in some extra flexibility in the design would be mighty useful.

For further context it was justifying using a goto statement to shave 12% of the execution time off of a function. Knuth bringing it up was specifically to acknowledge that he is aware of the principle to stave off arguments. I more often see it used to push back against any changes for speed.
Also, of course there are always exceptions to a platitude. I don't think we need to couch every single statement we ever make with "...but there are exceptions, of course!" which is basically what Knuth goes on to belabor.
More like such platitudes are nearly devoid of meaning:

Premature X is bad.

Overusing X is bad.

These are true for most X. If it's not bad, then you didn't do it prematurely or overuse it!

> Uptime 15,364 days

Is the uptime really technically true? Sure, the Voyager has been operating for 40+ years, but all embedded systems must have watchdog timers. And given how hostile the space environment is, I'll be surprised if the main system hasn't been reset by a watchdog timer for a couple of times to recover from fault conditions, thus the actual uptime must be much less than that.

Voyager 2's computer was reset in 2010:

It suddenly started returning corrupted frames. So engineers had it slowly transmit a core dump, and found a single flipped bit. To fix the problem, they "reset" the computer, which I assume means they rebooted it.

From :

> Most spacecraft have more than one Command Loss Timer Reset for subsystem level safety reasons, with the Voyager craft using at least 7 of these timers.

Isn't this splitting hairs? The real achievement it seems to me is continuous operation within spec. The watchdog timers (whatever they are) are part of the design that enabled this. I don't think it lessens the achievement at all.
Sure, but it's not conventionally what is called "uptime", an uninterrupted period in a normal and responsive state. Voyager has had faults and been rebooted. In any other situation that I'm aware of, you'd have to say that uptime started over or that uptime is <100%.
> I don't think it lessens the achievement at all.

I was not implying that "the fact that the system has been rebooted lessens the achievement", I said none of it. I was just wondering about details of the system and technical accuracy of the statement, isn't it the point of posting on HN?

Running a probe for 15,364 days without even a single bit flip or poweroff would be an extraordinary miracle that exceeded all reasonable expectations, not simply the grestest accomplishment.

Please don't assume that every technical statements/questions imply undervaluation, criticism or attack, regardless of how common they are in tech.



"Practically all of Voyager's redundancy is gone now, either because something broke along the way or it was turned off to conserve power. Of the 11 original instruments on Voyager 1, only five remain"

Though I did wonder if they updated the firmware/code in all this time and als no definitive answer stood out:

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.