HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
2011 GAFS Omega John Wilkes

GoogleTechTalks · Youtube · 9 HN points · 6 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention GoogleTechTalks's video "2011 GAFS Omega John Wilkes".
Youtube Summary
Google Faculty Summit
July 14-15, 2011

Bio:

John Wilkes, Cluster Management, Mountain View

John Wilkes got his PhD from Cambridge University, and joined HP Labs in 1982 where he was elected an HP Fellow and an ACM Fellow in 2002 for work on storage system design. His interests span many aspects of distributed systems, with a recurring theme around self-management of infrastructure systems. In November 2008 he moved to Google, where he is working on cluster management and service level agreements for infrastructure services. In his spare time he continues, stubbornly, trying to learn how to blow glass.
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
They host pretty much everything -- Maps, Gmail, Search, etc. Highly suggest reading the Wired link below as it talks about this in detail [1], but if you want a more technical look, check out the Youtube video [2], or the Borg paper [3]. One really interesting development, is that Mesos + Marathon + Chronos + Docker, puts a very similar system into the hands of your average IT team today [4].

[1] http://www.wired.com/2013/03/google-borg-twitter-mesos/

[2] https://www.youtube.com/watch?v=0ZFMlO98Jkc

[3] https://research.google.com/pubs/archive/43438.pdf

[4] https://www.youtube.com/watch?v=hZNGST2vIds

There was a great talk by John Wilkes (Google Cluster Management) re: Omega in 2011 at Google Faculty Summit [1]. Absolutely fascinating to see the scope of the problems they are dealing with.

[1] https://www.youtube.com/watch?v=0ZFMlO98Jkc

Edit: remove error in my comment re: borg/omega order.

nostrademons
Omega actually comes after Borg (hence the name). It's just that Borg was quite confidential until now. I'm pretty surprised they let the name slip.
WestCoastJustin
Ah, thanks! Updated my comment.
tonfa
The name was public since Omega was publicized. The paper itself is great, there's everything about Borg in one place (scheduling, quota, isolation, etc).
tonfa
It is the other way: Omega was presented as the successor of Borg.
sylvinus
If you want to see an up-to-date talk about this from John, he will be a speaker at http://dotscale.io on June 8!
Google has plenty of experience with containers already, since they heavily use cgroups and the concept of containers in their production environment for isolation and resource control, never mind the fact that two of their engineers had written much of the initial cgroups code. I talked about this in my "Introduction to Containers on Linux using LXC" screencast [1]. Briefly, Google is in the process of open sourcing their internal container code [2], there was a Wired article that talked about their container orchestration system [3], and finally, there was John Wilkes (Google Cluster Management), who talks about the container management system [4].

Docker adds very interesting filesystem ideas and software for the management of container images. Personally, I think we are on the cusp of a transition from VPS (xen/hvm) to VPS (containers). I also hope that Google throws some of their concepts at the Docker project. Interesting times for this space.

[1] http://sysadmincasts.com/episodes/24-introduction-to-contain...

[2] https://github.com/google/lmctfy

[3] http://www.wired.com/2013/03/google-borg-twitter-mesos/all/

[4] http://www.youtube.com/watch?v=0ZFMlO98Jkc

jacquesm
> Personally, I think we are on the cusp of a transition from VPS (xen/hvm) to VPS (containers).

I'm not so sure of that. I think a lot of the use-cases for VMs are based on isolation between users and making sure everybody gets a fair slice. Something like docker would work well with a single tenant but for multi-tenant usage docker would give you all the headaches of a shared host and very little of the benefits of a VM. For those use cases you're probably going to see multiple docker instances for a single tenant riding on top of a VM.

The likes of Heroku, AWS, Google etc will likely use docker or something very much like it as a basic unit to talk to their customers, but underneath it they'll (hopefully) be walling off tenants with VMs first. VMs don't have to play friendly with each other, docker containers likely will have to behave nicely if they're not to monopolize the underlying machine.

jamesaguilar
> I think a lot of the use-cases for VMs are based on isolation between users and making sure everybody gets a fair slice.

Containers do this.

jacquesm
Hm. We'll see about that. I can see a whole pile of potential issues here with 'breaking out of the docker' on par with escaping from the sandbox and breaking the chroot jail, which I see this as a luxury version of.

Of course you could try to escalate from a VM to the host (see cloudburst) but that's a rarity.

Docker seems to be less well protected against that sort of thing, but I'm nowhere near qualified to make that evaluation so I'll stick to 'seems' for now. It looks like the jump is a smaller one than from a VM.

jamesaguilar
This isn't really a "we'll see" issue. It is a fact that containers do resource isolation. :P The security issues are orthogonal.
jacquesm
I beg to differ. If you manage to break out of a container then all the resources of the machine are at your disposal.

So they're orthogonal only as long as the security assumptions hold.

thrownaway2424
Containers don't isolate very well. One thing that is easy to do is to make the system do disk output on your behalf just by making lots of dirty pages, or make the system use lots of memory on your behalf due to network activity. And of course there are the usual problems that you already have with VMs such as poor cache occupancy.

Shared hosting of random antagonistic processes is something that many developers are not quite ready to embrace. If you are willing to run your service with poor isolation and questionable security then containers are just the thing. You'll definitely spend less money if you can serve in such an environment.

thockingoog
Fair usage of resources and security isolation are two VERY different problems. Containers can be VERY good at resource isolation. Security has not really been figured out yet.
FooBarWidget
I don't know where this myth came from that you NEED VMs for fair slicing. The Linux (and most other OS kernels) have been doing fair slicing just fine for years. I think the disadvantage of containerization is similar to those of OpenVZ VPSes: you can't partition your harddisk and you can't add swap space.
contingencies
You can add swap space, and there is even swap space acounting support in the kernel. Personally I don't use swap, I just buy fat amounts of RAM and allocate them to diskless worker-nodes in my clusters. As for partitioning, manual partitioning can give a slight speed advantage (if you know which filesystem you want to use, you have a long enough lived job to justify optimization, etc.), but generally you can just use http://zfsonlinux.org/ or at least LVM2 to avoid the segregation requirement entirely. In the former (ZFS) case you get arbitrary-depth COW snapshots, dynamic reallocation, transparent compression, and other types of useful options for ~free, as well. In the latter (LVM2 LV) case you get single-depth snapshots (though in theory this is improving; eg. via thin provisioning) but no dynamic resizing support (AFAIK, unless you use nonstandard filesystems).
jacquesm
It's not a myth. A VM is effectively a slice of your computer that you can pre-allocate in such a way that that VM can not exceed its boundaries (in theory of course, this works perfectly, in practice not always).

So all other things being equal, if you slice up your machine into 5 equally apportioned segments and you run a user process in one of those 5 slices that tries to hog the whole machine it will only manage to create 1/5th of the load that it would be able to create if it were running directly on the guest OS.

So yes, linux does 'fair slicing' if you can live with the fact that a single process will determine what is fair and what is not. That that process gets pre-empted and that other processes get to run as well does not mean the machine is not now 100% loaded.

Using quota for disk space, 'nice', per-process limits for memory, chroot jails for isolation and so on you can achieve much the same effect but a VM is so much easier to allocate. It does have significant overhead and of course it has (what doesn't) it's own set of issues but resource allocation is actually one of the stronger points of VMs.

justincormack
Well yes, but kvm is a vm thats just using Linux to do this vm scheduling. The main issue is that the API for containers is less well defined (IO scheduling is not necessarily fully fair with VMs, but its mainly aio on the host side at least).
pling
I want option 3. A 4U rack with 32 completely isolated embedded stand alone quad core ARM or PPC systems, a network switch and an FPGA on each connected to the switch fabric.

Then we can start doing some interesting stuff past finding new ways to chop computers up.

fanf2
That does not sound very high density compared to what you can get from a company like Baserock - http://www.baserock.com/servers
pling
I want a hefty FPGA attached to the CPU bus and switch backplane. That will take a lot more power than the ARM core.
wooger
32 ARM chips in 4U seems very low to me, just in terms of the TDP a 4U rack is able to dissipate at present. You could increase density a lot.
pling
You could but I want standard storage per node (PCI-E FLASH), redundant PSUs and the TDP of a hefty FPGA going flat out is a lot larger than that of the ARM core.
jacquesm
Very interesting, that would be something I'd buy just to mess around with, I can think of a few ways in which I'd use it right off the bat and if you give me couple of hours more I'll have a whole raft of them :)
tzm
Great points. Btw, Victor and Rohit from the Google LMCTFY team are now active maintainers of libcontainer. https://github.com/dotcloud/docker/blob/master/pkg/libcontai...

Also check out Joe Beda's deck from GlueCon: http://slides.eightypercent.net/GlueCon%202014%20-%20Contain...

Docker is a natural fit for GCE.

WestCoastJustin
Great links -- thanks!
billybofh
I'm always surprised that OpenVZ[1] doesn't come up more in discussions about containers. I used to use it extensively in for small lightweight VPS's (usually just an apache or mysql instance) and always found it to be pretty impressive. I've used Docker to make a debian-packaged app available on a bunch of CentOS machines and it saved me a huge headache (the dependency tree for the app was huge) so I'm a fan - but still a little puzzled at OpenVZ's seeming invisibility.

[1] http://openvz.org/Main_Page

FooBarWidget
It's probably because of the way OpenVZ is marketed (or should I say, not marketed). OpenVZ's technology could probably do the same as what Docker does but they're not marketing it in the same way. The concept matters just as much as the actual technology.
billybofh
I guess. Having a commercial version with a different name to push can't have helped the branding either.
wmf
OpenVZ was basically the prototype for LXC. Distros seem to have better support for LXC since it's "official".
contingencies
Don't forget Linux vServers as well.
billybofh
Yeah, I realise it's place in LXC history. It still seems slightly odd that it's been kind of overlooked. It offered quite a lot (and still does) that I don't see replicated in any of the other container packages. At least not without quite a lot of manual faff.

Possibly it was just a little ahead of it's time and was also overshadowed by the rise of HW virtualisation in the later 2000's. Having to install a custom kernel (certainly when I used it) was also a bit of a hassle mind you. Anyway - maybe someone will re-invent the toolchain using Swift or Node and it'll become cool again ;-)

justincormack
It also has a security focus, while docker has started with convenience for deployment rather than looking like an isolated machine.
mikecb
There's also the recent article about automatic and machine-learning based power and core management.[1], [2]

If anyone here specializes in similar things, I would be curious to know if this Pegasus system runs on top of or underneath Borg/Omega (or perhaps replaced it?), or is a separate system altogether.

[1] http://www.theregister.co.uk/2014/06/05/google_pegasus_syste...

Edit: [2] http://gigaom.com/2014/05/28/google-is-harnessing-machine-le...

dragonwriter
> Personally, I think we are on the cusp of a transition from VPS (xen/hvm) to VPS (containers).

There may be some of that, but I think more common will be continuing to have tradition IaaS bring-your-OS-image services, with a new tier in between IaaS and PaaS (call it CaaS -- Container host as a service), plus a common use of IaaS being to deploy an OS that works largely as a container host (something like CoreOS).

rsync
"Personally, I think we are on the cusp of a transition from VPS (xen/hvm) to VPS (containers)."

A transition back to, I think ... the very first VPS provider (JohnCompanies, 2001)[1] was based entirely on FreeBSD jail and was entirely container based, even though those containers were presented as standalone FreeBSD systems.

[1] Yes, verio did have that odd VPS-like enterprise service earlier than that, but JC did the first VPS as we came to know it.

Touche
Pardon my ignorance but don't containers share the same kernel as the host? Meaning I can't run a Ubuntu container in a BSD jail or vice-versa? I don't want to use containers if it limits my OS choice to being from the same family as the host.
Yeah, cgroups are pretty awesome! I created a screencast about them @ http://sysadmincasts.com/episodes/14-introduction-to-linux-c... if you wanted to learn more about them and have never used them before.

There is also a great talk by John Wilkes (Google Cluster Management, Mountain View) about job scheduling via Omega at Google Faculty Summit 2011 @ http://www.youtube.com/watch?v=0ZFMlO98Jkc

patrickxb
Great screencast....thanks!
nisa
Thanks! Much appreciated!
nilsimsa
Nice website. Will be viewing the rest of your screencasts...
WestCoastJustin
Thanks! If you have any suggestions or episode ideas, please let me know!
There is a great wired article [1], which outline how Google uses orchestration software to manage linux containers in its day-to-day operations. Mid way through the article there is this great diagram, which shows Omega (Google's orchestration software), and how it deploys containers for images, search, gmail, etc onto the same physical hardware. There is an amazing talk by John Wilkes (Google Cluster Management, Mountain View) about Omega at Google Faculty Summit 2011 [2], I would highly recommend watching it!

By the way, one of the key concepts of containers is, control groups (cgroups) [3, 4], and this was initially added to the kernel back in 2007 by two Google engineers, so they have definitely given back in this area. I know all this because I have spent the last two weeks researching control groups for an upcoming screencast.

I am happy Google released this, and cannot wait to dig though it!

[1] http://www.wired.com/wiredenterprise/2013/03/google-borg-twi...

[2] http://www.youtube.com/watch?v=0ZFMlO98Jkc

[3] http://en.wikipedia.org/wiki/Cgroups

[4] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

stormbrew
Am I the only one who feels like cgroups are extraordinarily complex for the problem they're trying to solve? It seems like a simpler structure could have achieved most of the same goals and not required one or two layers (in the case of docker cgroup->lxc->docker) of abstraction to find widespread use.

In particular, was the ability to migrate a process or have a process in two cgroups really essential to containerization? It seems like without those it'd be a simple matter of nice/setuidgid-style privilege de-escalation commands to get the same kinds of behaviour without adding a whole 'nother resource management to the mix (the named groups).

The cgroups document you link to as [4] has such a weirdly contrived use case example it makes me think they were trying really hard to come up with a way to justify the complexity they baked into the idea.

shykes
The good news is that namespaces (the most interesting part of containers) are simpler than cgroups, and the api is stable.

cgroups are indeed a mess. The api is highly unstable and there is an effort underway to sanitize it, with the help of a "facade" userland api. In other words kernel devs are basically saying: "use this userland api while we fix our shit". (I don't claim to understand the intricacies of this problem. All I know is that, as a developer of docker, it is better for my sanity to maintain an indirection between my tool and the kernel - until things settle down, at least).

WestCoastJustin
cgroups are extremely powerful, but they are fairly complex, it took me some hands on experience to wrap my mind around it. Redhat has done a great job on the intigation side. You can watch a demo @ http://www.youtube.com/watch?v=KX5QV4LId_c
menage
(Original cgroups developer here, although I've since moved on from Google and don't have time to play an active role anymore.)

It's true that cgroups are a complex system, but they were developed to solve a complex group of problems (packing large numbers of dynamic jobs on servers, with some resources isolated, and some shared between different jobs). I think that pretty much all the features of cgroups come either from real requirements, or from constraints due to the evolution of cgroups from cpusets.

Back when cgroups was being developed, cpusets had fairly recently been accepted into the kernel, and it had a basic process grouping API that was pretty much what cgroups needed. It was much easier politically to get people to accept an evolution of cpusets into cgroups (in a backward-compatible way) than to introduce an entirely new API. With hindsight, this was a mistake, and we should have pushed for a new (binary, non-VFS) API, as having to fit everything into the metaphor of a filesystem (and deal with all the VFS logic) definitely got in the way at times.

If you want to be able to manage/tweak/control the resources allocated to a group after you've created the group, then you need some way of naming that group, whether it be via a filesystem directory or some kind of numerical identifier (like a pid). So I don't think a realistic resource management system can avoid that.

The most common pattern of the need for a process being in multiple cgroups is that of a data-loader/data-server job pair. The data-loader is responsible for periodically loading/maintaining some data set from across the network into (shared) memory, and the data-server is responsible for low-latency serving of queries based on that data. So they both need to be in the same cgroup for memory purposes, since they're sharing the memory occupied by the loaded data. But the CPU requirements of the two are very different - the data-loader is very much a background/batch task, and shouldn't be able to steal CPU from either the data-server or from any other latency-sensitive job on the same machine. So for CPU purposes, they need to be in separate cgroups. That (and other more complex scenarios) is what drives the requirement for multiple independent hierarchies of cgroups.

Since the data-loader and data-server can be stopped/updated/started independently, you need to be able to launch a new process into an existing cgroup. It's true that the need to be able to move a process into a different cgroup would be much reduced if there was an extension to clone() to allow you to create a child directly in a different set of cgroups, but cpusets already provided the movement feature, and extending clone in an intrusive way like that would have raised a lot of resistance, I think.

stormbrew
Cool, thanks for the details.
latchkey
There is also warden, but it doesn't get much press.

https://github.com/cloudfoundry/warden/tree/master/warden

jambay
yes, cloud foundry has been using warden for PaaS isolation between hosted apps for awhile. it was originally authored by redis and cloud foundry contributor pieter noorduis currently working for vmware [1]. the ongoing work has been continued by the cloud foundry team at pivotal.

warden has a c server core [2] wrapping cgroups and other features currently on the lmctfy roadmap like network and file system isolation [3]. the current file system isolation uses either aufs or overlayfs depending on the distro/linux version you are using [4]. the network uses namespaces and additional features.

warden also has early/experimental support for centos in addition to ubuntu, although some of the capabilities are degraded. for example, disk isolation uses a less efficient, but still workable copy file system approach.

the client orchestration of warden is currently written in ruby, but there was also a branch started to move that to go [5] that has not been hardened and moved into master.

recently cloudfoundry started using bosh-lite [2] leveraging warden to do full dev environments using linux containers instead of separate linux hosts on many virtual machines from an IaaS provider, which has dramatically reduced the resources and time required to create, develop and use the full system.

[1] https://twitter.com/pnoordhuis [2] https://github.com/cloudfoundry/warden/tree/master/warden/sr... [3] https://github.com/cloudfoundry/warden/blob/master/warden/RE... [4] https://github.com/cloudfoundry/warden/blob/master/warden/RE... [5] https://github.com/cloudfoundry/warden/tree/go

WestCoastJustin
Just a follow up to my comment. I have released the "Introduction to Linux Control Groups (cgroups)" screencast which I talked about in my previous comment. View it @ http://sysadmincasts.com/episodes/14-introduction-to-linux-c...
caniszczyk
You may want to check out Mesos: http://mesos.apache.org/
SEJeff
My understanding (from reading various articles) is that Mesos is very analogous to Google's original scheduler, Borg.
shykes
I just spend the last hour or so digging through the code and playing with CLI. It's pretty neat. The "specification" for creating new containers (how to describe limits, namespacing etc) is not very well documented so it takes some trial and error... But this feels like a nice, clean low-level component.

It can be used as a C++ library, too - I'm going to evaluate it as a possible low-level execution backend for docker :)

shykes
In response to the other comments mentioning docker - this definitely does not compete with docker. A better comparison would be with the lxc tools or maybe something like libvirt-lxc or systemd's nspawn.
alinspired
google is probably holding the piece that competes with docker or openvz, they should have somthing like docker internally
shykes
I'm sure they do - but again google is probably holding a proprietary alternative to virtually every piece of software in the world :) Most of it is too tied to "the google way" to be useful to anybody else.
thockin
This is more true than the idea that there's a competitive advantage. We would love to open-source more of our stack (and are working towards it), but it's all very tied to the rest of our cluster environment. Piece by piece.
azinman2
As an ex-googler I agree. Much if it just wouldn't be useful as is to most people as it's an environment that few match in load, resources, and diversity of services. Plus it's all integrated tightly across the board.
shykes
Quick summary of my early explorations:

* Building is relatively straightforward on an Ubuntu system. You'll need to install re2 from source, but that's about it.

* No configuration necessary to start playing. lmctfy just straight up mounts cgroups and starts playing in there.

* Containers can be nested which is nice.

* I really couldn't figure out useful values for the container spec. Even the source doesn't seem to have a single reference - it's all dynamically registered by various subsystems in a rather opaque way. I opened a few issues to ask for more details.

* This is a really low-level tool. Other than manipulating cgroups it doesn't seem to do much, which is perfect for my particular use case (docker integration). I couldn't figure out how to set namespaces (including mnt and net namespaces which means all my containers shared the host's filesystem and network interfaces). I don't know if that functionality is already in the code, or has yet to be added.

* Given the fairly small footprint, limited feature set, and clean build experience, this really looks like an interesting option to use as a backend for docker. I like that it carries very little dependencies. Let's see what the verdict is on these missing features.

menage
Most apps running on Google servers are aware that they're running in a shared environment, so they don't need the overhead of virtualized network interfaces. So I doubt that there will be any specific support for network namespaces.

And you can approximate mount namespaces with chroots and bind mounts. (In some ways that's better, since it's a bit easier for a process outside the container to interact with the container's filesystem).

jnagal
We (lmctfy team) are in the middle of designing and adding namespace support. It will start trickling in soon.
shykes
Damn. This means it's much less useful to me (and 99% of applications outside of google). I guess I could combine lmctfy with a namespacing library of my own. But that's more extra work than I was anticipating.
SEJeff
Perhaps they'd be open to a collaboration where you add that functionality and then you use it to make docker (even more!) awesome.
justincormack
The namespacing part is much simpler, if you have specific use cases.
thockin
Namespaces will be coming, but we're not there yet. This captures some of what we are already doing internally, but not all of it, yet.
jamesaguilar
> useful values for the container spec

Are you referring to the container spec in the proto file? https://github.com/google/lmctfy/blob/master/include/lmctfy.... Which attributes are you having trouble setting a useful value for?

shykes
Thank you! I was scanning .h files for a type declaration.. Silly old-school me :)
jamesaguilar
I only knew because I've worked with the internal analogue of this library recently. Glad I could help.
nickstinemates
According to the docs all that can be set is cpu and memory limits. So maybe that's the extent of it for now, even though the proto identifies more.
No. Docker just supports provisioning containers. Take a high level view of a data centre with 10,000 machines, you need orchestration software that knows about and automates these 10k machines, which ones are online/offline, utilization, storage, networking, current state, power distribution (what happens if a racks power drops out?), etc and then provisions these containers onto that hardware. Think of this orchestration software like a brain, and it knows about the current state of things, keeps a watch on what is happening in the data centre, fixes things when they go bad, and knows where to put things when you want to do something.

  Docker --> hardware
  Orchestration software --> Docker/LXC/VM/etc --> hardware
An additional example would be dotCloud (makers of Docker), they have orchestration software sitting atop Docker, which knows about users, machines, etc and then provision these docker instances on AWS hardware.

   dotCloud (orchestration software) --> Docker --> AWS EC2 (hardware)
There is a great wired article (linked to below [1] and in the OPs story), which outline how Google uses this orchestration software in its day-to-day operations. There is this great diagram [2], which shows Omega (Google's orchestration software), and how it deploys containers for images, search, gmail, etc onto the same physical hardware. There is an amazing talk by John Wilkes (Google Cluster Management, Mountain View) about Omega at Google Faculty Summit 2011 [3], I would highly recommend watching it!

ps. Orchestration software has a global view of resources across all machines, so it knows how to get the best utilization out of all these machine (think jigsaw puzzle). You submit a container profile to the orchestration software, things like instance lifetime, cpu, storage, memory, redudance, and the software will figure out where to place your instances.

[1] http://www.wired.com/wiredenterprise/2013/03/google-borg-twi...

[2] http://www.wired.com/wiredenterprise/wp-content/uploads/2013...

[3] http://www.youtube.com/watch?v=0ZFMlO98Jkc

electic
Thanks! That clears things up. I appreciate that.

P.S. - Not to Justin but others. Funny, you ask a honest question on here and people downvote you...sigh.

wmf
I think Docker is great but it's been so overexposed on HN that there's a bit of a backlash.
May 12, 2013 · 9 points, 0 comments · submitted by dsl
HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.