HN Theater @HNTheaterMonth

The best talks and videos of Hacker News.

Hacker News Comments on
CppCon 2015: Fedor Pikus PART 1 “Live Lock-Free or Deadlock (Practical Lock-free Programming)"

CppCon · Youtube · 39 HN points · 3 HN comments
HN Theater has aggregated all Hacker News stories and comments that mention CppCon's video "CppCon 2015: Fedor Pikus PART 1 “Live Lock-Free or Deadlock (Practical Lock-free Programming)"".
Youtube Summary
http://www.Cppcon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/cppcon/cppcon2015

Part I: Introduction to lock-free programming. We will cover the fundamentals of lock-free vs lock-based programming, explore the reasons to write lock-free programs as well as the reasons not to. We will learn, or be reminded, of the basic tools of lock-free programming and consider few simple examples. To make sure you stay on for part II, we will try something beyond the simple examples, for example, a lock-free list, just to see how insanely complex the problems can get. Part II: having been burned on the complexities of generic lock-free algorithms in part I, we take a more practical approach: assuming we are not all writing STL, what limitations can we really live with? Turns out that there are some inherent limitations imposed by the nature of the concurrent problem: is here really such a thing as “concurrent queue” (yes, sort of) and we can take advantages of these limitations (what an idea, concurrency actually makes something easier!) Then there are practical limitations that most application programmers can accept: is there really such a thing as a “lock-free queue” (may be, and you don’t need it). We will explore practical examples of (mostly) lock-free data structures, with actual implementations and performance measurements. Even if the specific limitations and simplifying assumptions used in this talk do not apply to your problem, the main idea to take away is how to find such assumptions and take advantage of them, because, chances are, you can use lock-free techniques and write code that works for you and is much simpler than what you learned before.

Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp. His earlier positions included a Senior Software Engineer at Google, and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in 1998 when he made a switch from academic research in computational physics to software industry. His responsibilities as a Chief Scientist include planning long-term technical direction of Calibre products, directing and training the engineers who work on these products, design and architecture of the software, and research in new design and software technologies. Fedor has over 25 patents and over 90 papers and conference presentations on physics, EDA, software design, and C++ language.

Videos Filmed & Edited by Bash Films: http://www.BashFilms.com *-----*
Register Now For CppCon 2022: https://cppcon.org/registration/
*-----*
HN Theater Rankings

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.
Sep 22, 2022 · chrsig on Lockfree Algorithms (2010)
Fedor Pikus has given a number of great talks on the subject at cppcon over the years

some links, in no particular order:

CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?

https://www.youtube.com/watch?v=ZQFzMfHIxng

CppCon 2016: Fedor Pikus “The speed of concurrency (is lock-free faster?)"

https://www.youtube.com/watch?v=9hJkWwHDDxs

CppCon 2015: Fedor Pikus PART 1 “Live Lock-Free or Deadlock (Practical Lock-free Programming)"

https://www.youtube.com/watch?v=lVBvHbJsg5Y

CppCon 2015: Fedor Pikus PART 2 “Live Lock-Free or Deadlock (Practical Lock-free Programming)”

https://www.youtube.com/watch?v=1obZeHnAwz4

CppCon 2017: Fedor Pikus “Read, Copy, Update, then what? RCU for non-kernel programmers”

https://www.youtube.com/watch?v=rxQ5K9lo034

Tony Van Eerd has also given some talks on lock free queues, I'll leave it as an exercise to the reader to search youtube :)

Yet another article discussing atomics without discussing the memory model or memory fences.

Atomics are the easy part to understand. Memory fences and memory-ordering is the hard part that needs far more discussion. If you use atomics with the wrong memory fence, everything will break.

Memory fences are necessary to make the compiler, CPU, and caches put the data into main-memory in the correct order.

----------

Do NOT write Atomics unless you understand fences. It is absolutely essential, even on x86, to put memory fences in the correct spot. Even though x86 automatically has acquire/release consistency... the compiler may move your data ordering to the wrong location. Even with "volatile" semantics, your data may be committed to memory in the wrong order.

Fredor Pikus has an amusing example where you can have a properly coded lock but the compiler (may) mess you up: https://youtu.be/lVBvHbJsg5Y?t=811

----------

EDIT: if you use locks (spinlocks, mutex, condition variables, etc. etc.), the author of the library would have put memory fences in the correct place for you. Only if you use synchronization "manually" (and... you probably are doing that if you are writing atomics), do you have to think about memory fences.

stcredzero
Only if you use synchronization "manually" (and... you probably are doing that if you are writing atomics), do you have to think about memory fences.

I was thinking about adding atomic pointer references to a partially persistent (1) AABB tree, then allowing multiple actors in multiple threads each implementing a gameloop to query the AABB tree. Do I need to think about memory fences in that application?

I've also been thinking about implementing the above system without atomic pointers. Instead, the system would have an update phase where all writes to the AABB tree happen at once, then all of the actors are allowed to run concurrently for one game loop tick.

In both cases, I'm thinking of placing the AABB tree into shared memory, but my brief experience with using shared memory makes me wary of performance problems.

(1) Persistent in this sense, of course: https://en.wikipedia.org/wiki/Persistent_data_structure

dragontamer
> I was thinking about adding atomic pointer references to a partially persistent (1) AABB tree, then allowing multiple actors in multiple threads each implementing a gameloop to query the AABB tree. Do I need to think about memory fences in that application?

Probably?

Let me give an attempt at what a memory-fence actually is, and why I think you may need to think about memory fences.

Atomics solve one PART of the synchronization problem. The second problem, which is far more complex, is that the compiler, CPU, and Memory / Cache system reorders your reads-and-writes in a non-obvious way. These reorderings can lead to obscure bugs.

These reorderings happen because of single-threaded optimization. Programmers expect the compiler, CPU, and L1 cache to reorder statements to go as fast as possible. Ever wonder what the -O2 and -O3 compiler flags do? They'll eliminate variables and change your code in subtle ways.

Even if you run at -O0 optimizations, these conceptual reorderings exist at the CPU (store buffer) and L1 cache levels. So you cannot escape this unfortunate, complex reality.

Therefore: it is the programmer's responsibility to define a "Happens-before" relationship through memory fences / memory barriers. A "happens-before" relationship will kill optimizations (be it a compiler-optimization, CPU store-buffer optimization, or L1 cache optimization) in just the right way to make your code correct.

------------

With that out of the way, lets go into your AABB tree question.

Assume "node" is a node to a binary AABB tree, with "node->left" and "node->right" referring to its two children. Lets assume both "left" and "right" are atomic<> pointers.

Now think of the code:

    AABB_Node tmp = malloc();
    node->left->store(tmp); // This is a race condition!
    tmp.initialize(); 
The above code is wrong. You can't "publish" the data before it is initialized! "tmp" would contain garbage data (its "fresh" from the malloc). Therefore, you need to write the following code:

    AABB_Node tmp = malloc();
    tmp.initialize(); 
    node->left->store(tmp);
Okay, now you "think" you've solved the multithreaded issue, but you have NOT. Because the compiler says "tmp is an unnecessary variable" and optimizes it away. So the compiler may decide to emit:

    node->left->store(malloc());
    node->left->initialize(); // Compiler wants variable-elimination optimization
Uh oh, the compiler "moved" the code around, which is logical in a single-threaded setting. But this reordering destroys your multithreaded logic. Your other threads may get the malloc() before the initialize.

------------

"Happens-before" is the key concept. What we want to say is:

    AABB_Node tmp = malloc();
    tmp.initialize();

    __happens before__

    node->left->store(tmp);
Good news: modern C++11 allows you to specify a "happens before" relationship through memory fences.

    node->left->store(tmp, memory_order_release);
"Memory_order_release" is the minimum happens-before relationship we want. The C++11 standard states: "A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store."

The compiler will emit the low-level code needed to ensure the happens-before relationship will work through lower-level memory fence mechanisms. GPUs will have to clear their L1 cache (due to incoherent memory), while CPUs will only have to wait for the store-buffer to be flushed and MESI messages to be passed around (ARM has the "dmb ish" statement for example)

Regardless of the low-level details, the job of the programmer with memory fences remains the same. Specifying important "happens-before" relationships.

-------------

In the GPU world, where C++11 atomics do not exist yet, the happens-before relationship is created with a threadfence instruction. In CUDA, you would do:

    AABB_Node tmp = malloc();
    tmp.initialize();

    __threadfence_block(); // Block-level synchronization

    node->left = tmp;
__threadfence_block() is only a fence that works within a CUDA-workgroup (and is therefore far faster than __threadfence()). Effectively, you only wait for L1 cache (while __threadfence() waits for an L2 cache commit).

-------------

I'll butcher the explanations a bit (real code doesn't work like this, but it may help you to understand things). EDIT: Oh gosh, I got these backwards again in my first draft. I'm pretty sure I put them in the correct order now.

An "acquire-read" operation basically compiles from:

    foo();
    node->left.load(memory_order_acquire);
    bar();
Becomes (approximately)

    foo();
    register = node->left
    __happens before__ 
    bar();
Different CPUs have different memory guarantees. x86 will have __happens before__ be a no-op in this case.

An "release-store" operation compiles from:

    foo()
    node->left.store(register, memory_order_release);
    bar();
Becomes (approximately)

    foo();
    __happens before__ 
    node->left = register;
    bar();
The "strongest" operation is memory_order_seq_cst, which approximately translates into:

    foo();
    __happens before__
    node->left = register;
    __happens before__ 
    bar();
But this kind of heavy-ordering is the slowest to implement in practice. Hopefully that explains a few things.
stcredzero
Thanks. That was indeed helpful. A part of the problem is that I'm most familiar with barriers used in GC algorithms, and in sandboxing. So seeing the term "memory fence" was confusing. (I should've just read the Wikipedia page!)
rrss
> Topics like sequential consistency and memory barriers are critical pieces of the puzzle and can't be overlooked if you want to get the best out of your lock-free algorithms. I will cover them all in the next episode.

Not every article has to be an exhaustive description of everything you need to know.

Anywho, sequential consistency is the default for C++11 and C11 atomics, so you pretty much can use them without a complete understanding of memory consistency models.

dragontamer
> Not every article has to be an exhaustive description of everything you need to know.

But a warning would be nice. You don't want some poor beginner trying things out and getting these very complicated details inevitably wrong.

> Anywho, sequential consistency is the default for C++11 and C11 atomics, so you pretty much can use them without a complete understanding of memory consistency models.

C11 and C++ atomics aren't widely implemented yet in the GPU world. OpenCL 1.2 doesn't have them (and OpenCL 2.0 is not commonly implemented), CUDA doesn't have them yet, AMD ROCm doesn't have them yet.

There are lots of older libraries and compilers that use old-school memory fence style intrinsics as well. Not everyone is using the latest GCC / Clang / Visual Studio versions on the latest hardware. Work with any compiler from 10 years ago, and you will have atomics without sequential consistency.

-------

Just a simple disclaimer, such as: Java, C#, and C++11 atomics operate under sequential consistency by default. Other memory models exist especially for older compilers and GPU programmers. Under these other memory models, you need to understand memory fences to get the correct behavior from the compiler, the CPU, and the caches.

Maybe have a line or two about "memory fences are outside the scope of this blogpost". Its one of those things beginners should be aware of, if only to "scare" them into using C++11 atomics (instead of whatever other implementation may be out there)

dragontamer
Ah, you're right. They call them "barriers" instead of "fences", so I didn't pick up on that paragraph.

Still, C11 and C++ atomics aren't widely implemented yet in the GPU world. OpenCL 1.2 doesn't have them (and OpenCL 2.0 is not commonly implemented), CUDA doesn't have them yet, AMD ROCm doesn't have them yet.

And anyone with older compilers will probably be working with fence intrinsics, instead of "innate" barriers per atomic instruction.

Jun 10, 2019 · dragontamer on Memory Barriers (2003)
* https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-20...

* https://www.youtube.com/watch?v=lVBvHbJsg5Y

EDIT: Honorable mention is Preshing's blog. https://preshing.com/20131125/acquire-and-release-fences-don...

There's so much there its hard to pick out anything in particular in Preshing's blog as a "must read". Its all very useful, but its also spread out over a lot of posts. I just picked the first link from search engines

EDIT2: Honorable mention#2: The OpenCL 2.0 standard section 3.3.6: https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf

The C++11 draft standard defines "happens before" and all that good stuff, but it is a 1000+ page document. OpenCL 2.0 copy/pastes from the C11 / C++11 standards and puts it all in one section: 3.3.6, which is only about 3ish pages long (+additional pages for OpenCL specific stuff).

OpenCL2.0 is missing consume/release semantics, but nobody implemented that yet anyway (even in C++ world, no one has bothered to properly implement a consume/release compiler yet).

I guess OpenCL2.0 has the whole "local vs global" memory order thing going on, while C11 / C++11 is all one memory space. But I personally found OpenCL2.0's specification to be easier to read than the C++11 draft standard.

Jun 04, 2019 · 38 points, 4 comments · submitted by dragontamer
mehrdadn
My favorite talk on this is atomic<> Weapons by Herb Sutter... it goes a bit deeper I think:

https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-20...

https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-20...

dragontamer
That is also an excellent talk. Hard to say which one is "better".

Fortunately, the material is difficult enough that you'll probably want to watch both talks if you're studying this subject :-)

ameyv
That Java Updater crash though, couldn't be more epic :)
dragontamer
Part 2 available here: https://www.youtube.com/watch?v=1obZeHnAwz4

One of the absolute best presentations on lock-free programming I've ever seen. Although this is a CPPCon talk, the overall talk is high level and should apply to any programming language with Atomics and Memory Fences / Memory Barriers.

Atomics, Memory Fences, and Lock-free Multithreaded Programming is among the most difficult and convoluted subjects in all of modern programming. Pikus breaks down the subject with excellence, with many practical examples.

Dec 27, 2017 · 1 points, 1 comments · submitted by dragontamer
dragontamer
I watched this talk (as well as "Part 2": https://www.youtube.com/watch?v=1obZeHnAwz4), and it was great. One of the best introductions to lock-free programming that I've found.

Together with "Part 2", this talk mostly covers "Acquire" and "Release" semantics with regards to C++ Atomics. But the discussion points are sufficiently general that I'm certain any programmer will find it relevant.

Lock-free Multithreaded Programming is a mysterious art that requires good knowledge of memory barriers and the like. Its rare to find a video, such as this one, that so clearly goes over the concepts.

HN Theater is an independent project and is not operated by Y Combinator or any of the video hosting platforms linked to on this site.
~ yaj@
;laksdfhjdhksalkfj more things
yahnd.com ~ Privacy Policy ~
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.