Hacker News Comments on "Larry Hastings - Removing Python's GIL: The Gilectomy - PyCon 2016" PyCon 2016 Youtube Video

Rankings: this week · month (apr/may) · year (2024) · all time

digests · search

Hacker News Comments on
Larry Hastings - Removing Python's GIL: The Gilectomy - PyCon 2016

PyCon 2016 · Youtube · 179 HN points · 13 HN comments

HN Theater has aggregated all Hacker News stories and comments that mention PyCon 2016's video "Larry Hastings - Removing Python's GIL: The Gilectomy - PyCon 2016".

Youtube Summary

Speaker: Larry Hastings

You've heard of Python's ""GIL""... but what is it? What problems does it solve? How does it work? What are the ramifications of its design? Attendees should have a basic knowledge of multithreaded programming. Understanding C is helpful but not required.

Slides can be found at: https://speakerdeck.com/pycon2016 and https://github.com/PyCon/2016-slides

HN Theater Rankings

This course is unranked · view top recommended courses

Hacker News Stories and Comments

All the comments and stories posted to Hacker News that reference this video.

⬐

Nov 02, 2019 · the_why_of_y on Python adopts a 12-month release cycle

In addition to the points other people raised, "simply" removing the GIL by changing the current reference-counting GC to do atomic reference updates will also substantially reduce single threaded performance; there was a PyCon talk about that.
https://www.youtube.com/watch?v=P3AyI_u66Bw
See also Swift for another language whose implementation has performance issues caused by ubiquitous atomic reference counting.

⬐

Oct 02, 2018 · zbentley on Questions

https://youtu.be/P3AyI_u66Bw
Or search “gilectomy” on LWN. Or check out stackless etc.; it turns out removing the GIL technically is historically one of the easier parts of removing the GIL entirely.
It sounds like your main problem is with python’s GC model. Reference counting doesn’t have to be thread-unsafe, but in scripting/interpreted-ish languages that want their threading story to involve seamless (I.e. no special syntax unless you want it, it’s on you not to blow off your foot) sharing of state between threads at will, like Python and Ruby, a GIL or equivalent is the norm. Sometimes it’s not as intrusive as python’s, but it does seem like a necessary (or at least very likely) implementation pattern for languages that want to provide that seamlessness. You can have thread-safe reference counted GC in a traditional scripting language, but that tends to come with a much less automatic threading/concurrency API. Perl 5 is an example of that category, and it is implemented in C.

⬐ lisper
> a GIL or equivalent is the norm
Exactly. And why do you think that is? And in particular, why do you think it is the norm when it is decidedly NOT the norm for Common Lisp and Scheme? The norm for those languages is completely seamless native threading with no special syntax.

⬐ zbentley
It seems like you're arguing in favor of a different language rather than a different implementation of Python. If you wanted to implement, say, the Python class/'object' type in a thread-safe way that didn't rely on a GIL and still freed memory according to approximately the same GC promises as the Python interpreter, I suspect you'd end up implementing something GIL-like in Scheme or Lisp (though my experience extends only to some Clojure and Racket in school, so I may be unaware of the capabilities of some other dialects).
If you wanted to implement a language that looked a little like Python but had your favored Lisp's GC semantics and data structures, I'm sure you could. But it wouldn't be Python.
That's without getting into the significant speed tradeoffs--you can make these languages fast, and I get the impression that there has been a ton of progress there in the last decade. But when Python was being created, and when it had to deal with the implications of concurrency in its chosen semantics? Not so. As I originally said: was it theoretically possible to build Python or equivalent on top of a non-C platform at the time? Sure. But I doubt that would have saved it from tradeoffs at least as severe as the GIL, and it definitely would not have been the pragmatic choice--"let's build it on $lisp_dialect and then spend time optimizing that dialect's runtime to be fast enough for us to put an interpreted scripting language on top of it" seems an unlikely strategy.

⬐ lisper
> If you wanted to implement, say, the Python class/'object' type in a thread-safe way that didn't rely on a GIL and still freed memory according to approximately the same GC promises as the Python interpreter, I suspect you'd end up implementing something GIL-like in Scheme or Lisp
Nope. CLOS+MOP, which exists natively in nearly all CL implementations, is a superset of Python's object/class functionality.

⬐

Aug 14, 2017 · gshulegaard on Let's Remove the Global Interpreter Lock

I feel like the GIL is, at this point, Python's most infamous attribute. For a long time I thought it was also the biggest flaw with Python...but over time I care less and less about it.
I think the first thing to realize is that single-threaded performance is often significantly better with the GIL than without it. I think Larry Hasting's first Gilectomy talk was extremely insightful (about the GIL in general and about performance when removing the GIL):
https://youtu.be/P3AyI_u66Bw?t=23m52s
I am not sure I would, personally, trade single-threaded performance for enabling multi-threaded applications. I view Python as a high-level rapid prototyping language that is well suited for business logic and glue code. And for that type of workload I would value single-threaded performance over support for multi-threading.
Even now, a year later, the Gilectomy project is still slightly off performance-wise (although it looks really really close :) ):
https://youtu.be/pLqv11ScGsQ?t=27m32s
As noted elsewhere, multi-processing offers adequate parallelization for this type of logic. Also, coroutines and async libraries such as gevent and asyncio offer easily approachable event loops for maximizing single-threaded resource utilization.
It's true that multi-processing is not a replacement for multi-threading. There definitely are tasks and workloads where multi-processing and its inherent overhead make it unsuitable as a solution. But for those tasks, I question whether or not Python itself (as an interpreted, dynamically typed language) is suitable.
But that's just my $0.02. If there is a way to remove the GIL without negatively impacting single-threaded performance or sacrificing reference counting for a more robust (and heavy) GC, then I am all for it. But if there is not...I would just as soon keep the GIL.

⬐ sametmax
Which is what multi-interpreters is a good solution. You keep the GIL and the benefits of it, but you loose the cost of serialization and can share memory.

⬐ VectorLock
The GIL has been a much bigger problem for perception than it ever has been for performance. Python has lost more mindshare over it than anything else. The few machine cycles that were ever saved by moving away from it were far outweighed by the waste of human cycles.

⬐ bsder
The few machine cycles that were ever saved by NOT moving away from it (which is the ONLY justification for keeping it) were far outweighed by the waste of human cycles.
If Python would simply suck it up and eat the 20% performance hit, we could stop talking about the GIL and start optimizing code to get the 20% back.

⬐ xenadu02
Many projects have solved this problem with dual compilation modes and provide two binaries the user can select from at runtime.
Eliminating the GIL doesn't have to mean actually eliminating it. You could certainly have #defines and/or alternate implementations that make the fine-grained locks no-ops when compiling in GIL mode. Conversely make the GIL a no-op in multithreaded mode.

⬐

Mar 09, 2017 · makmanalp on CPython internals: A ten-hour codewalk through the Python interpreter (2015)

The first 5 minutes of this covers that in general: https://www.youtube.com/watch?v=P3AyI_u66Bw

⬐

Mar 09, 2017 · pfranz on CPython internals: A ten-hour codewalk through the Python interpreter (2015)

Larry Hastings - Removing Python's GIL: The Gilectomy - PyCon 2016 https://youtu.be/P3AyI_u66Bw
(I'm pretty sure this is the video I'm thinking of) It's 30m, but worth it if you're interested. Not sure what progress has been made since then.

⬐

Feb 04, 2017 · gshulegaard on The multiprocessing module in Python

Just to add more context for others...
C extensions called from Python are still subject to the GIL of the calling interpreter[1].
In fact, this guaranteed thread safety has led to the creation of many non-thread safe Python C extensions which is now a critical blocker to efforts to removing the GIL[2], although there are also a number of other issues as well.
As noted in the StackOverflow, you can release the GIL but there are some tricks to doing so safely[3].
But threads of a C extensions library are not necessarily pinned to a single core by the GIL like regular Python byte code.
Although many performance oriented libraries like numpy[4] forsake multi-threading by default in many cases and rely on parallelization by the user (programmer)[5].
[1] http://stackoverflow.com/questions/651048/concurrency-are-py...
[2] https://youtu.be/P3AyI_u66Bw
[3] https://docs.python.org/3/c-api/init.html#thread-state-and-t...
[4] http://www.numpy.org/
[5] http://stackoverflow.com/questions/16617973/why-isnt-numpy-m...

⬐

Jan 25, 2017 · xapata on Problems I Have with Python

Python has made some trade-offs that you dislike. You complain about the negative consequences without comparing those against the benefits.
One of the major factors in speed is efficient memory layout. Contrast a Python list with a NumPy array. To achieve speedier loops and vectorized arithmetic [0], the array gives up dynamic typing and dynamic sizing. In most applications, I would gladly give up some compute speed to gain some programming productivity.
I love duck-typing. Formal typing has some impressive examples, but in the projects I've worked on has reduced my productivity. Perhaps because data are so often serialized to simple formats or written to databases that discard the best tools of formal typing. Anecdotal evidence, for sure.
I've never been bothered by the GIL, but I have benefitted from it [1].
[0] https://docs.continuum.io/mkl-optimizations/
[1] https://www.youtube.com/watch?v=P3AyI_u66Bw

⬐ stcredzero
One of the major factors in speed is efficient memory layout.
In most applications, I would gladly give up some compute speed to gain some programming productivity.
By Smalltalk standards, Python is pretty profligate. (By 90's C programmer standards, Smalltalk is pretty profligate.) However, Smalltalk still has many of the high productivity features as Python. (In fact, the debugging story is far superior.) I suspect, though, that Python is still a far superior environment for the things you use it for.

⬐ xapata
I enjoyed coding homework assignments in Smalltalk, but for some reason I never tried using it professionally.

⬐ darkf
>Contrast a Python list with a NumPy array. To achieve speedier loops and vectorized arithmetic [0], the array gives up dynamic typing and dynamic sizing. In most applications, I would gladly give up some compute speed to gain some programming productivity.
Except numpy arrays have a much richer interface and can still store dynamic objects (dtype=object). So what's your point?
>I love duck-typing
So do I. Where does this come from? I don't believe I ever considered it a contra.

⬐ xapata
If you're storing generic objects rather than numbers in a NumPy array, you're discarding its main benefits. Sure, you've got some extra slicing sugar for selecting columns and subsets, but comprehensions are more readable (and faster!) in many of those situations.
Duck-typing is Python's form of dynamic typing and therefore results in the speed penalty. If you want the extra speed, you'll need to give up some dynamicism. I say this now, but some of the work the core devs are doing to optimize dicts might let us have our cake and eat it too. Until then, it's a choice: flexible or fast, not both.

⬐ nneonneo
Have you ever tried appending to a NumPy array in a loop? It's a total disaster! And dtype=object arrays are mostly useless; they gain almost none of the benefits of regular NumPy (you may as well run np functions on plain lists) and play poorly with other types. NumPy is great for numerics and structured data - lists are general purpose structures for data manipulation. They are different, have different goals and trade offs, and I don't think it's appropriate to claim that one size should fit all.

⬐

Sep 10, 2016 · zeckalpha on Proposal of a new concurrency model for Ruby 3 [pdf]

Yes. There was an excellent talk at Pycon 2016, entitled the GILectomy: https://www.youtube.com/watch?v=P3AyI_u66Bw

⬐

Aug 23, 2016 · gtirloni on Python JITs are coming

Interesting talk to answer part of that question: https://www.youtube.com/watch?v=P3AyI_u66Bw

⬐

Jul 07, 2016 · gshulegaard on Why is Python slow

To be fair, the GIL was included (AFAIK) because it was simple to implement. Also, Guido has said (2007) [1] that he would welcome a patch to remove the GIL if...:
> "... I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease."
Unfortunately, experiments thus far have not succeeded to meet these requirements.
There is some work being done by the Gilectomy project to try and meet this bar as well as some other requirements currently though [2]. But it is currently grappling with the afore-discovered performance issues that come with removing the GIL.
Also at PyCon 2016, Guido himself mentions the Gilectomy project and it's potential consideration (if it works) for Python 3.6+ [3].
[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235
[2] https://www.youtube.com/watch?v=P3AyI_u66Bw
[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s

⬐

Jul 07, 2016 · gshulegaard on Why is Python slow

To be fair, the GIL wasn't included because it was a simple threading model (AFAIK). It was included because it was simple to implement and it was/is fast(er) (than removing it)[1][2].
If the Gilectomy [2] project succeeds, Guido has mentioned he would consider it for Python3.6+ [3].
[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235
[2] https://www.youtube.com/watch?v=P3AyI_u66Bw
[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s

⬐

Jul 07, 2016 · gshulegaard on Why is Python slow

The BDFL (Benevolent Dictator For Life) Guido Von Rossum himself put forth the idea that he would consider a patch removing the GIL in 2007 [1]:
> "... I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease."
Unfortunately, experiments thus far have not succeeded to meet these requirements.
There is some work being done by the Gilectomy project to try and meet this bar as well as some other requirements currently though [2]. But it is currently grappling with the afore-discovered performance issues that come with removing the GIL.
Also at PyCon 2016, Guido himself mentions the Gilectomy project and it's potential consideration (if it works) for Python 3.6+ [3].
So when you say Python leaders declared the GIL a "non-issue", I think you are oversimplifying the actual reality of what removing the GIL means and why leaders (like Guido) have been reluctant to invest resources pursuing.
[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235
[2] https://www.youtube.com/watch?v=P3AyI_u66Bw
[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s

⬐

Jul 03, 2016 · tedmiston on Why is Python slow

See Larry Hastings' talk "Removing Python's GIL: The Gilectomy" from PyCon 2016.
https://www.youtube.com/watch?v=P3AyI_u66Bw
As well as his earlier talk "Python's Infamous GIL".
https://www.youtube.com/watch?v=4zeHStBowEk
The GIL is a simple design that provides serious benefit to Python (end user) developers.
There are plenty of options for working around this for parallelism, such as the built-in module multiprocessing.

⬐ craigyk
I liked the talk, but when he talks about the GIL keeping Python "fast" I have to laugh. It's so slow already, who cares? He is talking about a factor of 1.3 to 2X to remove the GIL in a language that is already 30-100X slower than C/C++.

Removing Python's GIL: The Gilectomy [video]

⬐

Jun 05, 2016 · 168 points, 99 comments · submitted by jahan

⬐ pizlonator
I have some thoughts:
- Doing better than atomic inc/dec for those reference counts is going to be hard. All of those techniques are still in an "unconfirmed myth" state AFAICT: someone published a paper but nobody has confirmed that the result holds on a broader set of machines, workloads, baselines, etc.
- Great call on userspace locking. Note that you can do this portably. You don't need special OS support. See https://webkit.org/blog/6161/locking-in-webkit/
- Seems like lots of the locking use cases can indeed be made lock-free if you are willing to roll up your sleeves and get dirty. That's what I would do.
- I still bet that the dominant cost is lock contention and he is not analyzing his data correctly. He appears to claim that it can't be locks because the total CPU time is greater than the total length of critical sections and that some analysis tools tell him that there is massive cache trashing. But that's exactly what happens if you contend on locks too often. Lock contention causes context switches and thread migrations. Both of those things require cache flushes. So the code that runs under contention will report massive cache thrashing because it will have a high probability of being on a cold cache. Programs indeed will run slower under contention than without it, and while his slow-down is extreme, I've seen worse and fixed it by removing some contention. He should find every contended lock and kill it with fire.
- The dip at 4 cores doesn't surprise me. Computers are strange and most "scalability" charts (X axis is CPUs, Y axis is some measure of perf) I've made had weirdly reproducible dips and jerks.
⬐ denfromufa
I asked about lock-free data structures from 2 major open-source libs, but Larry found major issues with them. He may consider some of these data structures in the future. I opened a bounty for this.
See closed github issues for gilectomy.
⬐ pizlonator
Not lock-free data structures. That's usually a fool's errand. You need lock-free hacks.
Exhibit #1: the lock to protect lazy initialization. That just needs a CAS on the slow path and a load-load fence on the fast path. Delete the object you created if you lose the race and try again.
Exhibit #2: you can probably do dirty things to make loading from dicts/lists not require locks even though storing to them does.
⬐ denfromufa
How maintainable are these hacks in the long run? Is there firm theoretical basis, if yes, then sources?
⬐ pizlonator
The lazy initialization example is a classic algorithm. Usually concurrency hackers rediscover it from first principles. It's easy to maintain because the concurrency protocol it uses is easy to understand. Here's how I'd write it:
    template<typename T, typename... Args>
    void constructSingleton(Atomic<T*>& singletonRef, Args&&... args)
    {
        for (;;) {
            T* oldValue = singletonRef.load();
            if (oldValue)
                return oldValue;
            T* newValue = new T(std::forward<Args>(args)...);
            if (singletonRef.compareExchangeWeak(nullptr, newValue))
                return newValue;
            delete newValue;
        }
    }
This works so long as the singletons are happy to be deleted. I'm assuming that's true here.
⬐ denfromufa
I think this is C++11, while CPython is using C89 and may update to use some C99 features supported by major compilers (GCC, Clang, MSVC).

⬐ pizlonator
Then write it in C89. It's not hard. My first lock-free algo was in C89.
⬐ hueving
What a nice talk. It's a shame the only question we got to hear him answer from the audience (only time left for 1) was such a lame troll. If you attend conferences, please don't waste everyone's time with nonsense.
This is a python conference talking about a long-known perf issue and the guy just asked why not use another language. It's like going to WWDC and asking why everyone isn't using Android.

⬐ soupshield
I almost didn't watch this because it's over my head, but the speaker was enthusiastic and made it fun. It's a shame there wasn't time for questions. I'm now going to watch the original Larry Hastings GIL talk: https://www.youtube.com/watch?v=4zeHStBowEk

⬐ airless_bar
I think it's a very valid question. Given that the constraints by the Python BDFL rule out any reasonable solution, why not move on?

⬐ sametmax
The problem is that the question is in the wrong context. Somebody is trying very hard to do something despite those constraints. For free. Why would you even try to discourage him to do so ?

⬐ airless_bar
To save him time and frustration due to his project going nowhere?

⬐ sametmax
Since when trolling convinced somebody of anything?
This was definitly not the intent of the troll, just a joke. But a badly executed one. It's like going to a political meeting, hearing somebody's plan to improve economy, and as a question asking him why not just change country. It makes little sense.
Plus, everybody should consider the resume of the guy, cause he is not just a newcomer trying to show off. He is a hell of a good dev.

⬐ airless_bar
I don't believe that the question was meant to be a joke or a trolling attempt.
The guy's competence isn't the issue. It's the requirements of what the fix can and cannot do.
_Everything_ is a mess due to the massive technical debt of never improving key parts of the runtime.
I mean what's the point of a runtime where adding threads makes things magnitudes slower than running in a single-thread?
"But that's just the state right now, they will fix this!"
No, they won't. Just as a perspective: People usually fight over a single percent of improvement or less when working on runtime concurrency. Nobody just goes in and fixes _magnitudes_ of performance issues without rethinking what has been done and picking a completely different approach. This is not about making things "faster". This is literally going from doing necessary work to discovering a way of not having to do the work at all anymore. That's just not going to happen. Neither locks nor reference counting have anywhere near this optimization potential given the current language semantics.
I think the whole (C)Python community missed the train more than a decade ago. Things that necessarily had to happen just didn't happen. Many communities and groups of developers "professionalize" over time. This doesn't seem to have happened in Python.
Just an example: C extensions. It has been clear for at least a decade that the existing C interface won't work for threading, "real" garbage collection, etc.
What would have been smart: Providing a better interface 10 years ago, effectively giving C ext devs 10 years of time to migrate their code. This way efforts to remove the GIL today would have to satisfy one fewer constraint that's currently crippling all of the efforts.
Same with a lot of other things ... getting rid of the GIL would have required changes to various parts of the stack over the years (GC, language -> Python 3?, APIs ...), turning the actual removal of the GIL just into a final act of a multi-step process.
What actually happened: Nothing. And now they just try to break the large lock into millions of smaller locks ... in 2016. WAT? This just tells me that key people in the Python community never really gave a damn about the issue and therefore this guy will waste his time.
I have never expected Python's demise, but it now seems that largely its culture and not its lacking technology brought it down for good. From my perspective, they should have never released Python 3 with a GIL. That's what broke Python's neck, finally.

⬐ l1ambda
Interesting train of thought. So if I understand correctly the gist of it is that it starts to becomes so costly to fix Python than to use another language that already has threading...so why not just use another language.

⬐ airless_bar
Costly: yes.
Why not use another language? I think it's a perfectly valid question for that guy to ask.
My personal opinion on "Why not use another language" is a bit different though:
Adopting a different language might make sense – not necessarily for technical reasons but for cultural. The technical issues could have been addressed in time, and the language would be in a much better situation today.
The fact that the technical issues have not been addressed for such a long time shows a clear lack of leadership and focus. I think these cultural issues are a bigger problem than the technical issues. You can fix code, but you can't fix people.

⬐ needusername
I have been subjected to more than a decade of Python propaganda about the GIL how it is a good thing because:
* it prevents deadlocks
* it prevents livelocks
* it means you don't have to lock when you share across threads
and how it's not a big deal because
* you're IO-bound anyway
* you can easily write performance critical parts in C
* you can use multiple processes, in fact that's better design anyway
So I would have asked more critical questions. Feel free to call that trolling if you want.

⬐ booop
It seems to be a common mindset among python developers. You often hear the reason you can't have something in Python is because it's theoretically impossible, impractical and too difficult - but then it happens in another language. I was convinced that dynamic languages would always be slow (see https://wiki.python.org/moin/Why%20is%20Python%20slower%20th...) but then Google gave us V8.

⬐ kris-s
I hope this is a fruitful endeavor but I can't help but feeling the pressure to "Not have another Py2 -> 3" situation will win out and the GIL will remain. Which, I think is completely fine.
For me Python is the ultimate glue code - it's the duck tape of my programming world. If I know multi-core performance is going to be an issue up front, I would pick another language.

⬐ IndianAstronaut
>If I know multi-core performance is going to be an issue up front, I would pick another language.
This is why I have shifted most of my code (I work in data warehousing) to Go. Python is handy for little scripts and data mining.

⬐ IndianAstronaut
>If I know multi-core performance is going to be an issue up front, I would pick another language.
This is why I have shifted most of my code (I work in data warehousing) to Go. Python is handy for little scripts and data mining.

⬐ smegel
Ideally Golang.
No other language really does it right like Go - unifying event based and traditional multi-threading paradigms in a way that transparently utilizes all the cores on your system, while allowing you to write plain old iterative, blocking code.
Go may be less than ideal in many other regards (i.e. the rest of the language), but it gets this right.

⬐ rubiquity
> No other language really does it right like Go
Not trying to start a war but... Erlang, Haskell, F#, and a few others abstracted Evented IO and parallelism before Go even existed.

⬐ smegel
I should have added mainstream, non-functional, compiled language.

⬐ whateveracct
Moving the goalposts a bit there, aren't you?

⬐ smegel
No, I think they are common assumptions.

⬐ chrisseaton
They obviously aren't - this whole thread is about Python, which is not a compiled language. So clearly nobody moving from Python to Go already had a requirement for it to be compiled.

⬐ smegel
I was talking about the points "mainstream, non-functional" which excluded all the previous counter-examples. I only added "compiled" as it is a notable feature of Go that may be seen as beneficial for performance and other reasons.

⬐ InclinedPlane
Rust?
I would never consider Go mainstream myself though. Erlang is practically more mainstream.

⬐ smegel
Rust may well be, but it is not out yet?
Erlang is mainstream, but being functional probably not open to consideration for 99% of developers.

⬐ dalke
Rust, version 1.0.0, was released in May 2015, so over a year ago. 1.9 came out 10 days ago. Is that not out enough for you?

⬐ foota
@smegel, yes it is.

⬐ smegel
I have no idea. Is the language spec fully implemented by a robust and stable compiler/runtime?
If so, I'll use it today!

⬐ None
None

⬐ geofft
Yes, the compiler has fully implemented everything the language is intended to do since the 1.0 release a bit over a year ago. (Work focuses on the compiler and standard library itself; there isn't a separate spec doc that is updated to do things that aren't yet implemented or being implemented, so it's more like Python or GHC than R6RS or C++.) The compiler and the language are both stable; there's been no "runtime" in a strict sense since a few months 1.0, but there's a standard library which is usually statically linked in, and both the language and the standard library are covered by a very strong stability guarantee (more-or-less, code that works today and isn't deliberately doing something stupid will work indefinitely, as there are no plans for a Rust 2.0). Plenty of people are using it in production. So I'd go with "Yes."

⬐ nzjrs
I worry about the bus factor of rust. I was at the fosdem rust meet up / devroom and it was 95% Mozilla folk

⬐ majewsky
There might be a bias in that metric: Mozilla employees are probably more likely to afford the trip to FOSDEM.

⬐ kibwen
Less than half of the people listed at https://www.rust-lang.org/team.html are Mozilla employees, and the rust-lang/rust repo has over 1,400 contributors. I wouldn't worry too much about all of Mozilla getting hit by a bus. :)

⬐ lelandbatey
Rust offers almost everything that Go does, but with a (much) nicer compiler, tooling that makes me drool, memory safety, and all with zero-cost abstractions that keep Rust as fast or faster than the naive C++ equivalent.
So even though I work in a majority Go company, and I feel like Go is quite mainstream, I firmly believe that Rust is the future!
Btw, sorry if this Rust enthusiasm comes across as a bit over the top, I'm quite tired. I just really like it is all.

⬐ ngrilly
Rust is a great and powerful language, probably the best alternative to C++ nowadays, but I think you're overselling it.
Rust definitely does things Go doesn't, but Go also offers things Rust doesn't: builtin concurrency and parallelism with goroutines and channels, garbage collection (which is useful when your app can tolerate its moderate overhead), very fast compilation, and great tooling.

⬐ kibwen
Rust has builtin concurrency and parallelism in the standard library via threads, channels, and more, it just doesn't have green threads in the stdlib. As well, Rust also comes bundled with great tooling.

⬐ nemothekid
>it just doesn't have green threads in the stdlib
Green Threads is one of the most important features in golang - to pretend that "Rust offers almost everything that Go does", but ignore the number 1 feature is dishonest.

⬐ kibwen
I'm not the one who claimed that "Rust offers almost everything that Go does", but to say that concurrency and tooling aren't things that Rust excels at is equally dishonest. :P

⬐ rdtsc
If Erlang breaks there is a 50% you won't be able to see cat pictures on your smart phone anymore. It might not be cool with the startup crowd or the Google-following one, but is a an industrially used langauge for many products (RabbitMQ, CouchDB, Riak, Ejabberd, WhatsApp messaging, payment systems in Europe, and others).
It is cool you like Go, it is a nice language. But before you claim "it is the only language that does X", you should learn about other langues as well.

⬐ smegel
I know about Erlang and what it can do, and I maintain Go is the only mainstream, non-functional language that does abstraction of event based/threaded programming well, or at all. I'm not even sure Rust does...the best I can find is it was planned to, but it didn't make it in.
And I don't consider any functional language as a viable choice for the vast majority of mainstream users.

⬐ None
None

⬐ rubiquity
Outside of HackerNews and Google I wouldn't consider Go to be very mainstream at all.

⬐ Pxtl
Even good old c# with async/await does async I/O pretty well.

⬐ ngrilly
C# with async/await is a different approach: it's a kind of cooperative multitasking where "yield points" are explicit. Go uses a kind of preemptive multitasking, in which goroutines don't have to explicitly "yield" control to the scheduler.

⬐ fatbird
Hastings proposed avoiding the breaking changes issues by having separate extension integration points for python with the GIL and without. In other words, cpython would be compilable without the GIL, and extensions wanting true threaded parallelism would implement additional integration points to take advantage of that if run in a GIL-less python. So numpy, for example, would continue to work with the GILled python, but a data scientist would also have the GIL-less python available, set up a virtualenv for it, install numpy, and see all those benefits, without ever causing breaking changes for extension authors.
Dunno if that'll work in practice, but it seems like the best possible plan going forward to get the change without causing strife.

⬐ jamesdutc
I had a personal conversation with Larry Hastings (the presenter) at PyCon. Here are a couple of notes from that chat, phrased neutrally. Some of these points may be reïterated in the linked video:
- We can view this work is as a revisiting of Greg Stein's GIL-removal attempt in Python 1.4:
http://dabeaz.blogspot.com/2011/08/inside-look-at-gil-remova...
It seems wholly reasonable to revisit the approach in light of how the language and ecosystem have changed since 1999.
There are demands made of CPython core developers to remove or address the problem of the GIL, and these efforts demonstrate how much work is necessary to do that successfully.
- Comparing single-threaded performance in a GIL implementation against single-threaded performance in a GIL-less implementation is considered an unfair comparison. A GIL-less will do extra book-keeping that will necessarily result in slower single-threaded performance.

⬐ IanCal
> - Comparing single-threaded performance in a GIL implementation against single-threaded performance in a GIL-less implementation is considered an unfair comparison. A GIL-less will do extra book-keeping that will necessarily result in slower single-threaded performance.
Unless you can choose between having the GIL or not, I think it's perfectly reasonable to compare the performance. If you can choose, then I think it's still useful to know the kind of overhead you're adding.

⬐ Bromskloss
Is there anything inherent to the language that makes this global lock thing difficult to be without in Python, but not in other languages?

⬐ fatbird
Nothing inherent, no. Hastings discusses this is the talk: short version is that having a GIL gets python a fairly simple, straightforward implementation (also for extension authors); not having it requires a lot of work to achieve comparable performance. This is why so many interpreted languages use a GIL (like Ruby). No language requires it, it just makes the language much easier to develop and extend. Implementations like Jython that don't have one, don't because the hard work to not have one is already done in the VM.
It only took him about a week to remove the GIL (and the talk includes the steps); all this work is making a GIL-less python work as well as one with a GIL.

⬐ ubernostrum
Not really. Other languages have a GIL or GIL-like.
The backstory of Python's GIL is a tradeoff made nearly a quarter-century ago. Python came from the Unix world, where forking processes was the common way of having multiple lines of execution, but in the 90s threading was starting to take off as an alternative (insert standard anecdote about Java adopting threading since it had to run on set-top TV boxes that couldn't handle multi-process concurrency).
So Python needed to gain support for threading, but also needed to not horribly break a lot of existing code -- particularly extensions written in C -- and not destroy performance. Which was an issue since Python's memory management and garbage collection internals were not thread-safe, and it would've been both a major slowdown and a major breaking change to just say "all right, we'll do this in the theoretically best possible way".
The trade-off solution was to introduce this global lock which needed to be acquired and released. The side effects of the lock are twofold:
1. Only a single thread can be executing at a time within a given Python interpreter process, and
2. The bookkeeping around the lock introduces overhead. This overhead is negligible in some cases but ruinous in others.
The tradeoff seemed reasonable at the time, though. For one thing, hardware capable of exposing the single-thread limit as a relevant limit wasn't exactly common in 1992. For another, the overhead of the GIL mainly bites you in CPU-bound workloads, and at the time people wanted threading to write I/O-bound applications like networking daemons.
Here in 2016, of course, everybody has a multi-core computer in their pocket and we see plenty of CPU-bound workloads that we'd like to parallelize. Which is why people complain about the GIL. But now it's even more of a breaking change to try to remove it, and still has performance implications.

⬐ dalke
Inherent? No. Neither Jython nor Iron Python have a GIL.

⬐ the_mitsuhiko
> Is there anything inherent to the language that makes this global lock thing difficult to be without in Python, but not in other languages?
Both language and interpreter design. I was wanting to give a talk or at least write about some of the internals of the interpreter and language and how to do it better next time.
In case there is interest for this I might actually get around doing that for once.
(I have played around with a different from of GIL-less execution a few years back that was based on independent interpreters and message passing thread bound objects but I ran into so many issues with the interpreter :()

⬐ eric_bullington
> In case there is interest for this I might actually get around doing that for once.
Yes, I'm very interested in your thoughts on this, particularly wrt interpreter design.

⬐ bakery2k
I would certainly be interested to learn about better language and interpreter design.
Do you think anyone will ever use the lessons learned about interpreters and write a "better next time" implementation of Python, or do you only see an improved runtime appearing alongside a significantly different language?

⬐ pizlonator
The biggest problem is the use of reference counting.
Consider this simple statement:
tmp = p.f; o.f = tmp;
In a garbage-collected language that supports concurrency and has no GIL, this is a wait-free operations with no locks. You load p.f. You store o.f. Maybe there is a GC barrier, but that fast-paths to a load-branch combo 99% of the time. It's all easy and fast.
In a reference-counted language that supports concurrency and has no GIL, you're in a world of hurt.
First you have to make sure that you atomically increment the reference count on 'tmp'. This is guaranteed to have to store to the memory that 'tmp' points to. Note that in the GC world, no such store happened. Already here you have a problem: you're storing to memory much more often. Stores are the things that parallel CPUs hate, because the hardware now has to do work to propagate the value you stored to the other CPUs.
But wait, there's more. You also have to atomically decrement the reference count of the object that 'o.f' used to point to. That's another store that will have to be propagated. Now your algorithm for "o.f = tmp" looks like this:
1. Atomically increment the reference count of 'tmp'.
2. Load the old value of o.f and atomically decrement the reference count. If it drops to zero, delete the object.
3. Store 'tmp' into 'o.f'.
But wait, you're not done yet. You also have to handle the dec race on 'o.f'. If two threads simultaneously store to 'o.f', in which case you now have a race in steps 2 and 3. You'd think that you can just use an atomic swap in (2), but then you'd be wrong: you don't want two threads to both load the same old object from (lets that one 'oldfart') and dec it. Here's an example:
Thread #1: o.f = tmp;
Thread #2: o.f = thingy;
If the interleaving is Thread #1 executes step (2) then Thread #2 executes step (2) then Thread #1 executes step (3) then Thread #2 executes step (3), then you will dec 'oldfart' twice, inc 'tmp' once, and inc 'thingy' once. This means that you've over-released 'oldfart' and over-retained 'tmp'. Bad news!
So, you have to put a lock around steps (2) and (3). Basically every field in memory has to have a lock around it. That sucks!
It's kind of funny to me that people who are stuck with reference counting say things like "yeah, garbage collection can be made to be as fast as reference counting". That makes me lol. Garbage collection is so much faster than reference counting, particularly in concurrency scenarios, because:
In a GC: "o.f = p.f" is a load and a store, with maybe an extra load-branch if you have a barrier. That's it. There's no locks! There's no atomics! Notably, the only thing being stored to is "o.f", just as the user intended.
In reference counting: "o.f = p.f" means atomic ops on 'tmp', a lock around "o.f", and atomic ops on 'oldfart'. The lock will require one or two atomic ops. There will be atomic stores to the cache line of "o.f", the cache line of "tmp", and the cache line of "oldfart". What a mess!
TL;DR. Garbage collection is a slam dunk of overwhelming awesomeness.

⬐ jerf
It was ground in to CPython at an early date and every Python C extension ends up with it ground in, too, because it's in the C API. Some things are very difficult to retrofit on to an existing system. [1]
If you create a new Python implementation from scratch it doesn't need to be there. It is not intrinsic to Python; it is intrinsic to the specific CPython implementation.
It is also, broadly speaking, easy to remove the GIL. However the bar the CPython developers have set is that they will not accept a GIL-removal patch that harms single-threaded performance. This is the bar that has proved difficult to hurdle.
It is also fair to point out that while it's a perennial topic in the Python community, it is not the only scripting language implementation with a GIL or moral equivalent in it.
[1]: For similar reasons, none of the core implementations of the 1990s-era dynamic scripting languages have a very good true multi-threading story. (Some of the alternates can do it, like Jython.) They don't all even have implementations, and last I knew, none of them have anything that you ought to use in production. I don't think this is a fundamental limitation of any of the languages in question, it's just that it's really hard to retrofit threading onto a non-threaded code base after ~ten very heavy years of development. There's a set of other things that are hard to retrofit in to an existing code base if they don't start there from day one, like "unit testability" or "proper string handling".

⬐ eloff
This is not true - it is intrinsic to Python - not just CPython. Part of it is the garbage collector, reference counting is expensive to do without the GIL - but that can be changed. IronPython, Jython, some translations of PyPy all attest to that. The other part is C methods, which includes all operations on Python collections like lists and dicts, are atomic, making those collections safe to share and modify across threads. Most python code in the wild that utilizes threads makes use of those guarantees. So you either use fine-grained locking, or lock-free data structures, or STM, etc to protect ALL collection instances, or you've basically made a Python 3000 that breaks compatibility with all existing Python code. So every Python implementation that removes the GIL chooses to go the first route of pessimistically making everything threadsafe - which was tried in Java way back when and was and is a miserable failure. You give up too much performance for it to be worthwhile.
So the problem in Python will never be fixed in my opinion - the problem is Python and there's no saving it. Without breaking backwards compatibility it will always be better to use processes instead of threads and keep the GIL.

⬐ andreasvc
Reference counting and the C API are properties of the CPython implementation, not Python the language. Other Python implementations do NOT use reference counting and don't offer the C API, but do make guarantees about atomic operations. I think the problem with backward compatibility is with wanting to preserve the C API and the existing extension modules implemented in C, not necessarily with Python the language.

⬐ eloff
I should have been clearer that the other Pythons don't use reference counting. Preserving the C API can be done, see Ironclad[1] which allows using C extensions in IronPython.
But assuming threadsafe collections is still a problem - that will never match single-threaded GIL Python running in multiple processes unless some efficient way can be found to optimistically make the collections not threadsafe and only pay the synchronization costs when they're really shared between threads. I think that's a solvable problem, but not with the STM approach PyPy took. Still performance will likely lag the mutli-process solution which doesn't have that cost.
I would rather see a Python that introduces a way to allocate PyObjects in shared memory and with no thread safety. Since it's a new system, that can be done without breaking compatibility. Then sharing between processes could be as easy as sharing between threads and just as efficient. The GIL could stay in that case and the C API could be left unchanged.
[1] https://github.com/IronLanguages/ironclad

⬐ bakery2k
all operations on Python collections like lists and dicts, are atomic, making those collections safe to share and modify across threads. Most python code in the wild that utilizes threads makes use of those guarantees. So you either use fine-grained locking, or lock-free data structures, or STM, etc to protect ALL collection instances, or you've basically made a Python 3000 that breaks compatibility with all existing Python code.
Do IronPython and Jython, Python implementations which do not have a GIL, use fine-grained locking? How is multi-threaded performance on these implementations? Surely not as bad as this talk implies CPython with fine-grained locking would be?

⬐ eloff
IronPython uses a combination of lock-free techniques and fine-grained locking. Reference counting is of course replaced with the .NET runtime GC. Single-threaded and multi-threaded performance suffers since this approach means pessimistically synchronizing all accesses, including those that are single-threaded.

⬐ fatbird
You should watch the video. I won't try to repeat it all here, but Hastings identifies the performance problem areas and lays out some plausible directions for tackling it, and a plausible implementation/changeover path that doesn't create another big breaking change.

⬐ jerf
If you carefully read what I said, I think you'll find there's less disagreement between what you posted and what I said than your tone suggests you think. I did, for instance, mention C methods, but you seem to imply that that matters to all Python implementations. It only matters to CPython. Barring what last I knew was some fairly experimental stuff from PyPy, Python extensions are not generally compatible between the implementations. (Where you see libraries that appear to work on multiple implementations, AIUI you're actually seeing multiple implementations of the binding code.)
Personally my feeling is that if you care enough about performance to be threading in Python, you care enough about performance to not be using Python anyhow. It's a nice language, it's by far my favorite of the scripting languages of its family, but even with all the PyPy JIT and other magic tech words you can throw at it, it's still a slow language. When that doesn't matter, great; when it does, you eventually reach the point where you're stuck. (I've never been overwhelmed by the "implement it in C then" argument, for many reasons, and nowadays to an increasing degree if you're going to implement your "core" in another language for speed, there's a language you can choose that is both nice to use and already fast for your task, and that's only going to get more true as time goes on.)

⬐ eloff
The C methods do unfortunately matter to all implementations. They gave thread safety and almost all Python code in the wild relies on that property, so it's become an undocumented property of the Python language specification and an intrinsic flaw of the language itself.

⬐ andreasvc
It seems to me that the option of going with a tracing garbage collector is preferable. Removing the GIL will require a lot of changes anyway, and it may be better to go all they way so as to preserve performance. It would affect the C API and extensions modules would have to be reworked for this, but on the other hand you avoid all the issues with reference counting. If extension modules rely on code generation, such as Cython, a move to a radically different C API might be less painful than expected.
From what I know, most current managed languages are using GC, not reference counting, so perhaps it's an inherently better approach.
⬐ Animats
Most of the Python implementations other than CPython use GC rather than reference counting. PyPy does. Iron Python did.[1]
There is a version of PyPy without a GIL[2], but it runs much slower on ordinary code and is still under development. The developers are looking for financial support.[3] The approach is to identify large blocks of code as transactions, and run them in parallel. If they try to access the same data, one transaction fails and is backed out. It's like database rollback.
But you have to write your code like this:
    from transaction import TransactionQueue

    tr = TransactionQueue()
    for key, value in bigdict.items():
        tr.add(func, key, value)
    tr.run()
[1] http://doc.pypy.org/en/latest/cpython_differences.html [2] http://doc.pypy.org/en/latest/stm.html [3] http://pypy.org/tmdonate2.html
⬐ chadr
Semi related to this, is it possible to run multiple CPython interpreters in the same process, but with restrictions on shared memory? The idea being that each interpreter would still have its own GIL, each interpreter would have restrictions on shared memory (like sharing immutable structures only through message passing). Note, I'm not a big Python user so if this already exists, has been discussed, etc I am not aware.

⬐ jamesdutc
Sort of.
Note that Python has support for shared memory:
https://docs.python.org/2/library/multiprocessing.html#shari...
In fact, `numpy` has its own mechanisms to support shared memory between processes:
https://bitbucket.org/cleemesser/numpy-sharedmem
Neither of these approaches seem to be used very commonly in practice.
Python itself has some "sub-interpreter" support. There was a long conversation about this last year:
https://mail.python.org/pipermail/python-ideas/2015-June/034...
Finally, I have a working approach using `dlmopen` to host multiple interpreters within the same process:
https://gist.github.com/dutc/eba9b2f7980f400f6287
- the approach is so bizarre, because it's a very naïve multiple-embedding. It was intended to prove that you could run a Python 2 and a Python 3 together in the same process as part of a dare. This was thought impossible, since there are symbols with non-unique names that the dynamic linker would be unable to distinguish (which lead me to the `RTLD_DEEPBIND` flag for `dlopen`,) and that there is global state in a Python interpreter that interacts in undesirable ways (which lead me to `dlmopen` and linker namespaces.)
- this approach is stronger than the traditional subinterpreter approach, since I can host multiple interpreters of distinct versions. i.e., I can host a Python 1.5 inside a Python 2.7 inside a Python 3.5.
- the approach is stronger in that I completely isolate C libraries. There's a good amount of functionality provided by C libraries that maintain global state. e.g., `locale.setlocale` is a wrapping of C stdlib locale and is globally scoped.
- this approach is weaker in that it requires a dynamic linker that supports linker namespaces, which effectively limits its use on Windows
- this approach is weaker in that it's not complete: there's insufficient interest in this approach for me to actually write the shims to allow communication between processes.
- this approach is weaker in that it has some weird restrictions such as being able to spawn only 15 sub-interpreters before running out of thread-local storage space
I suppose the premise is that the GIL-removal efforts involve pessimistic coördination. A sub-interpreter approach might have a lighter touch and allow the user to handle coördination between processes (perhaps even requiring/allowing them to handle locks themselves.)

⬐ nzjrs
I followed the sub-interpreter thread with great interest, but after the implementation went in I haven't seen anyone build the kind of multiprocess tooling that it was designed to enable. Have you heard of anything?

⬐ geofft
I'm not deeply familiar with the CPython API, but that should be completely straightforward as long as you don't trade native CPython structures between interpreters. If you start with something like Numpy's array interface, and implement an array in C that knows that it could be accessed by multiple Python interpreters and needs to become immutable before it's passed, it should work. But I wouldn't expect it to be easy to take a regular PyObject and move it between interpreters, so the cost of serialization and deserialization might be an issue.

⬐ IanCal
For the performance improvements, I think I'm missing something but why does coalesced reference counting have a high overhead?
Is it normally implemented along with buffered reference counting? It feels like those fit together very neatly, one thread managing the counts and receiving updates from the other threads, and each other thread tries to only send updates it needs to.
Is it simply a case of doing something basic a lot of times is faster than doing something smarter a few times because computers are just really fast at basic things? Or is there something more to this?

⬐ ensiferum
Linux's pthread locks are already implemented on top of futex and have a fast user space path for the non-contended case.

⬐ daveguy
I'm just glad that Guido hasn't gone on record saying there will never be an official python without a GIL.

⬐ fatbird
Guido's on record as having three "political" requirements for any GIL-less Python:
1. Same or better performance 2. No breaking existing extensions 3. Not overly-complicating the cpython implementation
These are tough requirements, but obviously sensible, and Hasting's discussion of trying to meet them is interesting.

⬐ dbcurtis
This was a great talk, and was presented to a packed house (see below). If you don't want to watch the whole video, the issue really boils down to this:
1) Lot's and lot's of fine grained locks. One on every collection. Oooof.
2) All that locking and unlocking absolutely thrashes the cache with Larry's current prototype implementation. I was surprised at how little time was spent doing the actual locking/unlocking itself. I was blown away by the performance impact of maintaining cache coherent locks (at least in Larry's current implementation). And for reference, I was a logic designer on mainframes in the 1980's where we paid attention to making locks perform well across independent caches, so I'm no newb around this issue, but it was still striking to me.
(packed house) Larry's joke at the beginning about "practicing for getting on your plane later tonight" is a reference to the packed seating. The Portland convention center staff were taking the Portland fire marshall's directives quite seriously. We spent several minutes making sure every seat was occupied, and then the staff evicted the standee's :/
⬐ chrisseaton
I wonder if you could completely elide the CAS needed by userspace locks (the bit which thrashes the cache I presume) while there is only one thread running. Then when a new thread is created for the first time (if it ever is), pause the program and promote and acquire all the locks.
Like how the JVM assumes classes are final and removes virtual calls until it first sees a subclass.
⬐ gpderetta
Some JVM do exactly that. The optimisation is called biased locking and would likely work in python as well.

⬐ geofft
I think that was one of the suggestions from the after-talk conversation. The trouble is that you have to do it on thread creation, which is sort of a cheesy way to ensure single-threaded performance doesn't go down: it still leaves multi-threaded applications performing worse than under the GIL, which isn't super helpful. Ideally you'd do it on the first access to the object by a second thread, but there's no particular interface that makes it clear that's what's happening: the second thread just knows about the object somehow, and reads it at some point. And there are enough global variables in Python (e.g., every module) that simply promoting it when it's visible to multiple threads is too aggressive.

⬐ ori_b
There's a lot you can do. There's not much you can do in cpython as it currently stands without breaking a lot of code, especially when it comes to extensions.

⬐ chrisseaton
I work on a new Ruby interpreter, and our solution to this problem has been to interpret the C code of extensions. That way we can give it one API, but really implement another.

⬐ fatbird
Sorry if I keep responding, but there's a lot of interesting points in your question.
> Like how the JVM assumes classes are final and removes virtual calls until it first sees a subclass.
AFAIK, the python VM doesn't do any JITting at all. A runtime decision in the VM about handling locks is impossible at this point, I think.
⬐ dbcurtis
There is a version of Python that JITs -- I forget it's name at the moment, though.

⬐ anateus
PyPy

⬐ smortaz
there are several. we (microsoft) are also trying to get an api into cpython so that anyone can use to build JITs for python. our sample implementation is at:
https://github.com/Microsoft/Pyjion
pycon 2016 talk:
https://www.youtube.com/watch?v=1DAIzO3QXcA

⬐ ori_b
Sort of.
    if (threads)
         lock(...);
should still be pretty quite a bit faster than fine grained locks everywhere, since your atomics aren't broadcasting and shooting down cache lines everywhere.
⬐ chrisseaton
Well the user space locks he discusses already read a variable and branch between uncontended and contended cases. Specialising speculatively for single threaded would just be a third cannot-be-contended case. But yes it'd work better with a JIT - yet another reason to write one.
⬐ fatbird
Late in the talk he discusses strategies to avoid thrashing that amount to making locks as local as possible, ideally such that they behave the same way they would in the current unsynchronized environment. The big problem is that all the locks are global, so a change to a lock triggers a gigantic cascade of cache invalidation (and voided branch prediction, etc).
It's a great talk; the video is really worth watching. At the end, he says he welcomes anyone to join him in the sprints, but basically don't bother unless you really know cpython and multithreading programming because this is very advanced stuff.

⬐ chrisseaton
Right, but I'm suggesting why bother using atomic operations and cache invalidation if there is known to be just one thread? Which is the case we want the lowest overhead on. Why not only start doing that stuff when the second thread is started.

⬐ fatbird
Although, this suggestion is kind of close to Hasting's implementation plan of having separate compilations for GIL and GILless python, so that the default Python with a GIL continues to be the norm in most cases, but a GILless python with extensions taking advantage of true threaded parallelism is available as a 'high-performance' python for scientists and number crunchers and such.

⬐ None
None

⬐ fatbird
Ok, I think I see the point you're making, but what I took from Hastings talk is that devolving locking to as local a scope as possible is doing just that: wherever possible, locking in an analytically single threaded context which is trivial and basically cost-free.
The cache invalidation isn't done by cpython, it's done at the the chip level when global atomic incr and decr instructions are triggered. Hastings was pretty clear that the performance hit was very closely tied to the way that modern processors depend very heavily on caching, pipelining, branch prediction, etc.
I think that the kind of implementation you're suggesting, where the VM adopts a locking strategy based on whether or not it's actually multi-threaded, might be possible, but sounds like it conflicts with Guido's edict that no GIL-less solution will be accepted that overly complicates that cpython codebase.
I'm probably making a hash of the talk, but I saw it in person, and it was surprisingly clear to someone like me who isn't a serious C programmer, but remembers their class in operating systems and kernel design.

⬐ chrisseaton
The complexity requirement is the tricky one then, yes. I think I'd disagree with Guido there - a little complexity in the VM managed by a few experts is probably a good cost to accept for millions of users.

⬐ sametmax
It's not Guido's point. He means that CPython, being the reference implementation, must remain easy to understand.
⬐ Animats
Nothing is truly thread-local in Python. You can muck with anything from another thread, finding it with "dir" and altering it with "setattr" if necessary. This implies worst-case locking.
To get past all that locking, you need to be able to determine which objects are never visible to more than one thread. They don't need locks. PyPy might be able to do this; it does some global analysis.

⬐ mtanski
> 2) All that locking and unlocking absolutely thrashes the cache with Larry's current prototype implementation. I was surprised at how little time was spent doing the actual locking/unlocking itself. I was blown away by the performance impact of maintaining cache coherent locks (at least in Larry's current implementation). And for reference, I was a logic designer on mainframes in the 1980's where we paid attention to making locks perform well across independent caches, so I'm no newb around this issue, but it was still striking to me.
Sadly I'm not optimistic about the approach. Fine grained locking most times has a cost people don't think about. Using reference counting converts every read into two writes (to update ref counts). Then we get into second order effects where unrelated data structures share a cache line with the lock & reference counts. You're just fighting the machine architecture.
Even with additional optimizations, I just can't getting acceptable results with the approach as taken. With the constraints put in place by Guido around GIL elimination it looks like python on multicore is going to be what we have today.

⬐ sametmax
Multi-interpretter is another possibility that is currently explorer. Still need to load 1 interpretter by core, but no more serialization to share objects.

⬐ loqi
How so? Sharing arbitrary objects via memory requires some form of coordinated memory management. So you still need a process-wide GC, meaning you're probably stuck with that 30% hit from thread-safe refcounting.
And all the other locking issues are still in play. How would you share a list with another interpreter such that they don't both have access to the same list items? Any solution that looks like a deep copy isn't going to be much better than just serializing across a process boundary.

⬐ sametmax
Right now multi-processing use a very expensive serialization to send to one worker to another. Shared memory is much cheaper.
Even if you didn't have the GIL, you would still need locks with threads to manipulate any mutable datastructure. The usual concurreny safety advice always apply.

Larry Hastings – Python's Infamous GIL – PyCon 2016

⬐

Jun 02, 2016 · 11 points, 2 comments · submitted by cjdrake

⬐ harlowja

Nice handling of the crappy question at the end...

⬐ tedmiston

It's helpful to know the abstract for Larry's talk before watching:

    DESCRIPTION

    CPython's GIL means your Python code can only run on one CPU core at a
    time. Can we remove it? Yes, we can... in fact we already have! But is it
    worth the cost?

    ABSTRACT

    CPython's "Global Interpreter Lock", or "GIL", was added in 1992. It was
    an excellent design decision. But 24 years is a long time--today it
    prevents Python from capitalizing on multiple CPUs. Many people want us to
    remove the GIL.

    It turns out, removing the GIL isn't actually that hard. In fact, I
    already removed it, in my experimental "gilectomy" branch. But the GIL is
    one reason CPython is so fast! The "gilectomy" makes CPython shockingly
    slow.

    This talk will discuss the history of the GIL, how the GIL helps make
    CPython fast, how the "gilectomy" removed the GIL, and some ways we might
    be able to make the "gilectomy" version fast enough to be useful.

from https://us.pycon.org/2016/schedule/presentation/2101/

Hacker News Comments on Larry Hastings - Removing Python's GIL: The Gilectomy - PyCon 2016

Hacker News Stories and Comments

Hacker News Comments on
Larry Hastings - Removing Python's GIL: The Gilectomy - PyCon 2016