Python Parallelism
The GIL and What It Means
The Global Interpreter Lock (GIL) ensures only one Python thread executes Python bytecode at a time. This is a design choice in CPython — it simplifies memory management and makes C extensions safe, but it means threading does not give you CPU parallelism.
Threads in Python are only truly parallel for:
- I/O operations (network calls, disk reads, database queries)
- C-extension code that releases the GIL (numpy, scipy, pandas operations on arrays)
CPU-bound pure-Python work running on multiple threads will be slower than a single thread due to GIL contention and context-switching overhead. If your workload is CPU-bound, you need processes, not threads.
CPython 3.13 introduced an experimental free-threaded build (--disable-gil). It is not production-ready as of 3.13. Plan for it, but do not depend on it yet.
Decision Tree
Pick the right tool before writing code. The wrong choice wastes days:
I/O-bound with many small tasks (HTTP requests, file reads, database queries) Use asyncio. One event loop handles thousands of concurrent I/O operations with no thread overhead.
CPU-bound with simple embarrassingly parallel tasks (process each file, compute each row) Use concurrent.futures.ProcessPoolExecutor. The standard library path. Works well when each task is independent and takes more than ~10ms.
CPU-bound with large data arrays (matrix operations, image processing, signal processing) Use multiprocessing with shared memory, or offload to numpy/scipy (which release the GIL). For cluster scale, use Ray.
CPU-bound with complex task graphs or cluster scale (ML training, pipeline DAGs, distributed processing) Use Ray. It handles task scheduling, fault tolerance, and cluster management.
Numerical loops over arrays where numpy is too coarse Use Numba for JIT-compiled parallel loops.
When nothing in Python is fast enough Write the hot path in C++ with pybind11, or use GPU compute (CuPy, CUDA).
concurrent.futures.ProcessPoolExecutor
The standard library path for CPU-bound parallelism. It spawns worker processes, each with its own GIL, so CPU work actually runs in parallel.
from concurrent.futures import ProcessPoolExecutor, as_completed
def expensive(x):
return sum(i * i for i in range(x))
with ProcessPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(expensive, n): n for n in range(10_000, 10_010)}
for future in as_completed(futures):
n = futures[future]
result = future.result() # raises if the worker raised
print(f"{n} -> {result}")executor.map vs executor.submit: Use map when you have a simple function applied to an iterable and want results in order. Use submit when you need as_completed ordering or per-task exception handling.
max_workers tuning: Start with the number of physical cores (os.cpu_count()). Memory-heavy tasks may need fewer workers. Profile before tuning.
When to use multiprocessing.Pool instead: ProcessPoolExecutor is built on top of multiprocessing and is almost always the better API. Use multiprocessing.Pool directly only if you need maxtasksperchild (to control memory leaks in long-running pools) or initializer functions for per-worker setup.
Exception propagation: Exceptions in worker processes are re-raised in the parent when you call future.result(). Unhandled exceptions in executor.map surface on iteration. Always wrap .result() calls.
asyncio
Asyncio is for concurrency, not parallelism. One thread, one event loop, thousands of concurrent I/O operations.
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
async def main():
urls = ["https://example.com/api/1", "https://example.com/api/2"]
async with aiohttp.ClientSession() as session:
results = await asyncio.gather(*(fetch(session, u) for u in urls))
return results
asyncio.run(main())CPU-bound work in async code: Use loop.run_in_executor to offload to a ProcessPoolExecutor:
import asyncio
from concurrent.futures import ProcessPoolExecutor
def cpu_work(n):
return sum(i * i for i in range(n))
async def main():
loop = asyncio.get_event_loop()
with ProcessPoolExecutor() as pool:
result = await loop.run_in_executor(pool, cpu_work, 10_000_000)
print(result)
asyncio.run(main())Common mistake: Running CPU-bound work directly in an async def function. This blocks the event loop and starves all other coroutines. If a function takes more than ~1ms of CPU time, offload it.
multiprocessing Shared Memory
Passing large arrays between processes via pickle serialization is slow and doubles memory usage. multiprocessing.shared_memory lets processes share a memory buffer directly.
import numpy as np
from multiprocessing import shared_memory, Process
def worker(shm_name, shape, dtype):
existing = shared_memory.SharedMemory(name=shm_name)
arr = np.ndarray(shape, dtype=dtype, buffer=existing.buf)
arr *= 2 # modify in-place; visible to all processes
existing.close()
# Create shared memory and put data in it
data = np.arange(1_000_000, dtype=np.float64)
shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
shared_arr = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
shared_arr[:] = data # copy data into shared buffer
p = Process(target=worker, args=(shm.name, data.shape, data.dtype))
p.start()
p.join()
print(shared_arr[:5]) # [0. 2. 4. 6. 8.]
shm.close()
shm.unlink() # free the shared memory blockAlways call shm.close() in every process and shm.unlink() in exactly one process. Leaked shared memory persists until reboot.
joblib
The scientific Python standard for parallelism. Used internally by scikit-learn.
from joblib import Parallel, delayed
results = Parallel(n_jobs=-1)(
delayed(expensive_function)(x) for x in items
)Backend selection: joblib defaults to loky (a robust process pool). For numpy-heavy code that releases the GIL, threading backend avoids process-spawn overhead:
results = Parallel(n_jobs=-1, backend="threading")(
delayed(numpy_heavy_fn)(x) for x in items
)Memoization: joblib.Memory caches function results to disk, keyed by input arguments. Useful for expensive preprocessing that shouldn't repeat across runs:
from joblib import Memory
memory = Memory("/tmp/joblib_cache", verbose=0)
@memory.cache
def slow_transform(data):
# expensive computation
return resultWhen to use joblib vs ProcessPoolExecutor: Joblib handles numpy arrays efficiently (it memory-maps them instead of pickling), automatically tunes batch sizes, and integrates with scikit-learn's n_jobs parameter. Use it for scientific Python. Use ProcessPoolExecutor for general-purpose Python without scientific dependencies.
Numba
JIT compilation for numerical Python. Numba compiles Python functions to machine code at first call, producing performance close to C.
import numba
import numpy as np
@numba.jit(nopython=True)
def fast_sum_of_squares(arr):
total = 0.0
for x in arr:
total += x * x
return total
data = np.random.randn(10_000_000)
result = fast_sum_of_squares(data) # first call compiles; subsequent calls are fast@numba.prange for parallel loops: Numba can parallelize loops across cores:
@numba.jit(nopython=True, parallel=True)
def parallel_sum_of_squares(arr):
total = 0.0
for i in numba.prange(len(arr)):
total += arr[i] * arr[i]
return totalWhen Numba beats numpy: Numba wins when you have fused loops (multiple operations per element that numpy would need multiple passes for), in-place mutations, or complex branching that numpy can't vectorize. If your code is a single numpy call (like np.sum(arr**2)), numpy is already fast and Numba adds compilation overhead for no gain.
Limitations: Numba supports a subset of Python and numpy. No dicts, no classes (use @numba.jitclass for limited support), no string operations, no I/O. It's a tool for numerical inner loops, not general Python.
When to Stop Trying in Python
Python's overhead per operation is 10-100x higher than C++. For some workloads, no amount of multiprocessing or Numba will close the gap.
Consider moving to C++ with Python bindings when:
- You need fine-grained parallelism on data structures (not just arrays)
- Latency per operation matters (sub-microsecond)
- The inner loop has complex branching, pointer chasing, or custom data structures
- You need SIMD control beyond what Numba provides
The path: write the hot path in C++, expose it via pybind11, call from Python. This is how numpy, scipy, PyTorch, and most performance-critical Python libraries work.
For GPU workloads: CuPy gives you numpy-like syntax on CUDA. For custom GPU kernels, use CUDA directly or Numba's CUDA target (@numba.cuda.jit).