Skip to content

Files

Latest commit

d7876b9 · Nov 14, 2023

History

History

Concurrency in Python

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Nov 7, 2023
Nov 14, 2023

Concurrency

In Python, the things that are occurring simultaneously are called by different names (thread, task, process) but at a high level, they all refer to a sequence of instructions that run in order.

You might wonder why Python uses different words for the same concept. It turns out that threads, tasks, and processes are only the same if you view them from a high level. Once you start digging into the details, they all represent slightly different things.

Now let’s talk about the simultaneous part of that definition. You have to be a little careful because, when you get down to the details, only multiprocessing actually runs these trains of thought at literally the same time.

Threading and asyncio both run on a single processor and therefore only run one at a time. They just cleverly find ways to take turns to speed up the overall process. Even though they don’t run different trains of thought simultaneously, we still call this concurrency.

The way the threads or tasks take turns is the big difference between threading and asyncio. In threading, the operating system actually knows about each thread and can interrupt it at any time to start running a different thread. This is called pre-emptive multitasking since the operating system can pre-empt your thread to make the switch.

Pre-emptive multitasking is handy in that the code in the thread doesn’t need to do anything to make the switch. It can also be difficult because of that “at any time” phrase. This switch can happen in the middle of a single Python statement, even a trivial one like ``x = x +1`.

Asyncio, on the other hand, uses cooperative multitasking. The tasks must cooperate by announcing when they are ready to be switched out. That means that the code in the task has to change slightly to make this happen.

The benefit of doing this extra work up front is that you always know where your task will be swapped out. It will not be swapped out in the middle of a Python statement unless that statement is marked. This can simplify parts of your design.

With multiprocessing, Python creates new processes. A process here can be thought of as almost a completely different program, though technically they’re usually defined as a collection of resources where the resources include memory, file handles and things like that. One way to think about it is that each process runs in its own Python interpreter.

Because they are different processes, each of your trains of thought in a multiprocessing program can run on a different core. Running on a different core means that they actually can run at the same time, which is fabulous. There are some complications that arise from doing this, but Python does a pretty good job of smoothing them over most of the time.

❓ When Is Concurrency Useful?

Concurrency can make a big difference for two types of problems. These are generally called CPU-bound and I/O-bound.

I/O-bound problems cause your program to slow down because it frequently must wait for input/output (I/O) from some external resource. They arise frequently when your program is working with things that are much slower than your CPU.

In the diagram above, the blue boxes show time when your program is doing work, and the red boxes are time spent waiting for an I/O operation to complete. This diagram is not to scale because requests on the internet can take several orders of magnitude longer than CPU instructions, so your program can end up spending most of its time waiting.

On the flip side, there are classes of programs that do significant computation without talking to the network or accessing a file. These are the CPU-bound programs, because the resource limiting the speed of your program is the CPU, not the network or the file system.

Here’s a corresponding diagram for a CPU-bound program:

As you work through the examples in the following section, you’ll see that different forms of concurrency work better or worse with CPU-bound and I/O-bound programs. Adding concurrency to your program adds extra code and complications, so you’ll need to decide if the potential speed up is worth the extra effort.

💠 Threads

It’s tempting to think of threading as having two (or more) different processors running on your program, each one doing an independent task at the same time. That’s almost right. The threads may be running on different processors, but they will only be running one at a time.

Getting multiple tasks running simultaneously requires a non-standard implementation of Python, writing some of your code in a different language, or using multiprocessing which comes with some extra overhead.

Because of the way CPython implementation of Python works, threading may not speed up all tasks. This is due to interactions with the GIL that essentially limit one Python thread to run at a time.

❓ What is better to use: executor.shutdown(wait=True) and queue or as_completed from concurrent.futures?

Both executor.shutdown(wait=True) and as_completed from concurrent.futures can be used for waiting until all threads finish, and the choice between them depends on the specific requirements and structure of your program.

  1. executor.shutdown(wait=True) and queue:

    • This approach is simpler and more direct. executor.shutdown(wait=True) ensures that the executor does not accept any more tasks and waits for all submitted tasks to be completed.
    • You can use a Queue to safely collect results from threads.
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        for i, record in enumerate(data):
            executor.submit(invoke_lambda, record)
    
        # Wait until all threads finish
        executor.shutdown(wait=True)
  2. as_completed:

    • as_completed returns an iterator that yields futures as they are completed. This allows you to process the results as soon as they are available.
    • You can achieve a similar result to the queue approach by iterating over the futures returned by as_completed.
    futures = [executor.submit(invoke_lambda, record) for record in data]
    
    for future in concurrent.futures.as_completed(futures):
        result = future.result()
        # Process the result
    • 🗒️ Note: You'll need to ensure that you collect the results in a thread-safe manner if you are modifying shared data structures.

In summary, both approaches can be effective, and the choice depends on the specific requirements and structure of your application. If simplicity and directness are important, the first approach with executor.shutdown(wait=True) and a Queue may be more straightforward. If you need more fine-grained control over completed results, as_completed might be a good fit.

Daemon Threads

In computer science, a daemon is a process that runs in the background.

Python threading has a more specific meaning for daemon. A daemon thread will shut down immediately when the program exits. One way to think about these definitions is to consider the daemon thread a thread that runs in the background without worrying about shutting it down.

If a program is running Threads that are not daemons, then the program will wait for those threads to complete before it terminates. Threads that are daemons, however, are just killed wherever they are when the program is exiting.

ℹ️ Notes

The other interesting change in our example is that each thread needs to create its own requests.Session() object. When you’re looking at the documentation for requests, it’s not necessarily easy to tell, but reading this issue, it seems fairly clear that you need a separate Session for each thread.

This is one of the interesting and difficult issues with threading. Because the operating system is in control of when your task gets interrupted and another task starts, any data that is shared between the threads needs to be protected, or thread-safe. Unfortunately requests.Session() is not thread-safe.

There are several strategies for making data accesses thread-safe depending on what the data is and how you’re using it. One of them is to use thread-safe data structures like Queue from Python’s queue module.

Well, as you can see from the example, it takes a little more code to make this happen, and you really have to give some thought to what data is shared between threads.


The correct number of threads is not a constant from one task to another. Some experimentation is required.

📰 Articles

⚙️ Tools

💠 asyncio

ℹ️ Notes

An important point of asyncio is that the tasks never give up control without intentionally doing so. They never get interrupted in the middle of an operation. This allows us to share resources a bit more easily in asyncio than in threading. You don’t have to worry about making your code thread-safe.

📰 Articles

⚙️ Tools

💠 multiprocessing

⚙️ Tools

  • multiprocessing — Process-based parallelism
  • Parallel Python — is a python module which provides mechanism for parallel execution of python code on SMP (systems with multiple processors or cores) and clusters (computers connected via network).

📰 Articles


🔹 Concurrent API requests

🗞️ Other Related Topics and Articles

◾ Wikipedia