Tuesday, February 23, 2016

Concurrency in Python

More speed by having two rails. Not always true.
Last year, PEP 0492 got accepted, which introduced coroutines and async/await to Python. During that time, I started subscribing to some Python mailing lists and participated in discussions since then. I wondered how ordinary Python developers can write code that can be executed in parallel or at least concurrently. Specifically regarding asyncio (coroutines) and concurrency in general, we got a survey compiled which I want to record here.

Improving Performance by Running Independent Tasks Concurrently - A Survey

Processes Threads Coroutines
purpose cpu-bound tasks cpu- & i/o-bound tasks i/o-bound tasks
customizable no no yes
controllable no no yes
managed by os scheduler os scheduler + interpreter event loop
parallelism yes no no
switching at any time after any bytecode at user-defined points
shared state no yes yes
startup time biggest/medium* medium smallest
CPU overhead** biggest medium smallest
memory overhead biggest medium smallest
pool class multiprocessing.Pool multiprocessing.dummy.Pool asyncio.BaseEventLoop
solo class multiprocessing.Process threading.Thread asyncio.coroutine

* biggest - on Windows and if using 'spawn' ('fork'+'exec')medium - if using 'fork' alone
** due to context switching

I started this little survey out of curiosity and professional needs to speed up our Python production systems. What I can tell from the experience gained in that field is that you basically need to ask yourself two questions:
  1. Does this code run faster with concurrency?
  2. Is the code cpu-bound or i/o-bound?
Writing and maintaining concurrent code always makes many brains hurt. So, if you don't have any significant improvement, leave your code alone. If you still want to, you then either need to settle for processes (having cpu-bound tasks) or threads/coroutines (having i/o-bound tasks). As usual, your mileage may vary (also considering the table).

In the course of the thread, Steve Dower responded very ingeniously (here and here) which was my main driver to extend the survey. It explains certain values in the table quite easily, so I am going to quote him here for your convenience.


Steve Dower's "Let's Bake a Cake"
Let's say you are making a cake. There are two high-level steps involved:
  1. Gather all the ingredients
  2. Mix all the ingredients
  3. Bake it in the oven
You are personally required to do steps 1 and 2 ("hands-on"). They takes all of your time and attention and you can't do anything else simultaneously.

For step 3, you hand off the work to the oven. While the oven is baking, you are basically free to do other things.

In this analogy, "you" are the main thread and the oven is another thread. (Thread and process are interchangeable here in the general sense - the GIL in Python is practicality that makes processes preferable, but that doesn't affect the concepts.) Steps 1 and 2 are CPU bound (as far as "you" the main thread are concerned), and step 3 is IO bound from "your" (the main thread's) point-of-view.

Step 3 requires you to wait until it is complete:
  • You can do a synchronous wait, by sitting and staring at the oven until it's done.
  • You can poll, by occasionally interrupting yourself to walk over to the oven and see if it's done yet.
  • You can use a signal/interrupt, when the oven is ready, regardless of whether you are ready to handle the interruption (but note: you know that the oven is done without having to walk over and check it).
  • Or you can use asyncio, where you occasionally interrupt yourself and, when you do, the oven will make some noise if it has finished. (and if you never interrupt yourself, the oven never makes a sound)
This last option is most efficient for you, because you aren't interrupted at awkward times (i.e. greatly reduced need for locking on shared state) but you also don't have to walk all the way over to the oven to check whether it is done. You pause, listen, and get straight back to work if the oven is still going. That's the core feature of asyncio - not the networking or subprocess support - the ability to be notified efficiently that a task is complete without being interrupted by that notification.

Now let's expand this to making 3 cakes in parallel to see how "parallelism" works. Since there's so much going on, we'll create a TODO list:
  1. Make cake #1
  2. Make cake #2
  3. Make cake #3
(This means we've started three tasks to the current event loop. It's likely these are three external requests from clients, such as HTTP requests. It is possible, though not common in my experience, for production software to explicitly start with multiple tasks like this. More common is to have one task and a UI event loop that injects UI events as necessary.)

Task 1 is the obvious place to start, so we take that off the TODO list and start working on it. The steps to make cake #1 are:
  • Gather ingredients for cake #1
  • Mix ingredients for cake #1
  • Bake cake #1
Gathering ingredients is a synchronous operation (`def gather_ingredients()`) so we do that until we've gathered everything.

Mixing ingredients is a long, interruptible operation (`async def mix_ingredients()`, with occasional explicit `await yield()` or whatever syntax was chosen for this), so we start mixing and then pause. When we pause, we put our current task on the TODO list:
  1. Make cake #2
  2. Make cake #3
  3. Continue mixing cake #1
We see that our next task is to make cake #2, so we repeat the steps above and eventually pause while we're mixing. Now the TODO list looks like:
  1. Make cake #3
  2. Continue mixing cake #1
  3. Continue mixing cake #2
And this continues. (Note that selecting which task to continue with is a detail of the event loop you're using. Check the spec to see whether some tasks have a higher priority or what order tasks are continued in. And bear in mind that so far, we've only used explicit yields - "I'm ready to do something else now if something needs doing".)

Eventually we will finish mixing one of the cakes, let's say it's cake #1. We will put it in the oven (`await put_in_oven()`) and then check the TODO list for what we should do next. There's nothing for us to do with cake #1, so our TODO list looks like:
  1. Continue mixing cake #2
  2. Continue mixing cake #3
Eventually, the oven will finish baking cake #1 and will add its own item to the TODO list:
  1. Continue mixing cake #2
  2. Continue mixing cake #3
  3. Cake #1 is ready
When we take a break from mixing cake #2, we will continue mixing cake #3 (again, depending on your event loop's policy with regards to prioritisation). When we take a break from mixing cake #3, "Cake #1 is ready" will be the top of our TODO list and so we will continue with the statement following where we awaited it (it probably looked like `await put_in_oven(); remove_from_oven()` or maybe `baked_cake = await put_in_oven(mixed_ingredients)`).

Eventually our TODO list will be empty, and so we will sit there waiting for something to appear on it (such as another incoming request, or an oven adding a "remove cake" item).

Processes and threads only really enter into asyncio as a "thing that can post messages back to my TODO list/event loop", while asyncio provides an efficient mechanism for interleaving (not parallelising) multiple tasks throughout an entire application (or a very significant self-contained piece of it). The parallelism only comes when all the main thread has to do for a particular task is wait, because another thread/process/service/device/etc. is doing the actual work.

-----"But I still have a question: why can't we use threads for the cakes? (1 cake = 1 thread)."

Because that is the wrong equality - it's really 1 baker = 1 thread.

Bakers aren't free, you have to pay for each one (memory, stack space), it will take time for each one to learn how your bakery works (startup time), and you will waste some of your own time coordinating them (interthread communication).

You also only have one set of baking equipment (the GIL), buying another bakery is expensive (another process) and fitting more equipment into the current one is very complicated (subinterpreters).

So you either pay a high price for 2 bakers = 2 cakes, or you accept 2 bakers = 1.5 cakes (in the same amount of time). It turns out that often 1 baker can do 1.5 cakes in the same time as well, and it's much easier to reason about and implement correctly.
I further want to thank everybody who participated in discussing the matter and thus improved the survey enormously with a lot of patient explanations and insightful details.

There will be another post covering a module, I've written back then and which I maintain actively. It was and still is a way of digesting the concurrency matter in an attempt to improve usability and reduce boilerplate in Python.

Best,
Sven

No comments:

Post a Comment