Multi-processing techniques in Python

Has your multi-threaded application grown GILs? Take a look at these packages for easy-to-use process management and inter-process communication tools.

There is no predefined theme for this column, so I plan to cover a different, likely unrelated, subject every month. The topics will range anywhere from open source packages in the Python Package Index (formerly The Cheese Shop, now PyPI) to new developments from around the Python community, and anything that looks interesting in between. If there is something you would like for me to cover, send a note with the details to doug dot hellmann at pythonmagazine dot com and let me know, or add the link to your del.icio.us account with the tag “pymagdifferent”.

I will make one stipulation for my own sake: any open source libraries must be registered with PyPI and configured so that I can install them with distutils. Creating a login at PyPI and registering your project is easy, and only takes a few minutes. Go on, you know you want to.

Scaling Python: Threads vs. Processes

In the ongoing discussion of performance and scaling issues with Python, one persistent theme is the Global Interpreter Lock (GIL). While the GIL has the advantage of simplifying the implementation of CPython internals and extension modules, it prevents users from achieving true multi-threaded parallelism by limiting the interpreter to executing byte-codes in one thread at a time on a single processor. Threads which block on I/O or use extension modules written in another language can release the GIL to allow other threads to take over control, of course. But if my application is written entirely in Python, only a limited number of statements will be executed before one thread is suspended and another is started.

Eliminating the GIL has been on the wish lists of many Python developers for a long time – I have been working with Python since 1998 and it was a hotly debated topic even then. Around that time, Greg Stein produced a set of patches for Python 1.5 that eliminated the GIL entirely, replacing it with a whole set of individual locks for the mutable data structures (dictionaries, lists, etc.) that had been protected by the GIL. The result was an interpreter that ran at roughly half the normal speed, a side-effect of acquiring and releasing the individual locks used to replace the GIL.

The GIL issue is unique to the C implementation of the interpreter. The Java implementation of Python, Jython, supports true threading by taking advantage of the underlying JVM. The IronPython port, running on Microsoft’s CLR, also has better threading. On the other hand, those platforms are always playing catch-up with new language or library features, so if you’re hot to use the latest and greatest, like I am, the C reference-implementation is still your best option.

Dropping the GIL from the C implementation remains a low priority for a variety of reasons. The scope of the changes involved is beyond the level of anything the current developers are interested in tackling. Recently, Guido has said he would entertain patches contributed by the Python community to remove the GIL, as long as performance of single-threaded applications was not adversely affected. As far as I know, no one has announced any plans to do so.

Even though there is a FAQ entry on the subject as part of the standard documentation set for Python, from time to time a request pops up on comp.lang.python or one of the Python-related mailing lists to rewrite the interpreter so the lock can be removed. Each time it happens, the answer is clear: use processes instead of threads.

That response does have some merit. Extension modules become more complicated without the safety of the GIL. Processes typically have fewer inherent deadlocking issues than threads. They can be distributed between the CPUs on a host, and even more importantly, an application that uses multiple processes is not limited by the size of a single server, as a multi-threaded application would be.

Since the GIL is still present in Python 3.0, it seems unlikely that it will be removed from a future version any time soon. This may disappoint some people, but it is not the end of the world. There are, after all, strategies for working with multiple processes to scale large applications. I’m not talking about the well worn, established techniques from the last millennium that use a different collection of tools on every platform, nor the time-consuming and error-prone practices that lead to solving the same problem time and again. Techniques using low-level, operating system-specific, libraries for process management are as passé as using compiled languages for CGI programming. I don’t have time for this low-level stuff any more, and neither do you. Let’s look at some modern alternatives.

The subprocess module

Version 2.4 of Python introduced the subprocess module and finally unified the disparate process management interfaces available in other standard library packages to provide cross-platform support for creating new processes. While subprocess solved some of my process creation problems, it still primarily relies on pipes for inter-process communication. Pipes are workable, but fairly low-level as far as communication channels go, and using them for two-way message passing while avoiding I/O deadlocks can be tricky (don’t forget to flush()). Passing data through pipes is definitely not as transparent to the application developer as sharing objects natively between threads. And pipes don’t help when the processes need to scale beyond a single server.

Parallel Python

Vitalii Vanovschi’s Parallel Python package (pp) is a more complete distributed processing package that takes a centralized approach. Jobs are managed from a “job server”, and pushed out to individual processing “nodes”.

Those worker nodes are separate processes, and can be running on the same server or other servers accessible over the network. And when I say that pp pushes jobs out to the processing nodes, I mean just that – the code and data are both distributed from the central server to the remote worker node when the job starts. I don’t even have to install my application code on each machine that will run the jobs.

Here’s an example, taken right from the Parallel Python Quick Start guide:

import pp
job_server = pp.Server()
# Start tasks
f1 = job_server.submit(func1, args1, depfuncs1,
    modules1)
f2 = job_server.submit(func1, args2, depfuncs1,
    modules1)
f3 = job_server.submit(func2, args3, depfuncs2,
    modules2)
# Retrieve the results
r1 = f1()
r2 = f2()
r3 = f3()

When the pp worker starts, it detects the number of CPUs in the system and starts one process per CPU automatically, allowing me to take full advantage of the computing resources available. Jobs are started asynchronously, and run in parallel on an available node. The callable object returned when the job is submitted blocks until the response is ready, so response sets can be computed asynchronously, then merged synchronously. Load distribution is transparent, making pp excellent for clustered environments.

One drawback to using pp is that I have to do a little more work up front to identify the functions and modules on which each job depends, so all of the code can be sent to the processing node. That’s easy (or at least straightforward) when all of the jobs are identical, or use a consistent set of libraries. If I don’t know everything about the job in advance, though, I’m stuck. It would be nice if pp could automatically detect dependencies at runtime. Maybe it will, in a future version.

The processing Package

Parallel Python is impressive, but it is not the only option for managing parallel jobs. The processing package from Richard Oudkerk aims to solve the issues of creating and communicating with multiple processes in a portable, Pythonic way. Whereas Parallel Python is designed around a “push” style distribution model, the processing package is set up to make it easy to create producer/consumer style systems where worker processes pull jobs from a queue.

The package hides most of the details of selecting an appropriate communication technique for the platform by choosing reasonable default behaviors at runtime. The API does include a way to explicitly select the communication mechanism, in case I need that level of control to meet specific performance or compatibility requirements. As a result, I end up with the best of both worlds: usable default settings that I can tweak later to improve performance.

To make life even easier, the processing.Process class was purposely designed to match the threading.Thread class API. Since the processing package is almost a drop-in replacement for the standard library’s threading module, many of my existing multi-threaded applications can be converted to use processes simply by changing a few import statements. That’s the sort of upgrade path I like.

Listing 1 contains a simple example, based on the examples found in the processing documentation, which passes a string value between processes as an argument to the Process instance and shows the similarity between processing and threading. How much easier could it be?

Listing 1

#!/usr/bin/env python
# Simple processing example

import os
from processing import Process, currentProcess

def f(name):
    print 'Hello,', name, currentProcess()

if __name__ == '__main__':
    print 'Parent process:', currentProcess()
    p = Process(target=f, args=[os.environ.get('USER', 'Unknown user')])
    p.start()
    p.join()

In a few cases, I’ll have more work to do to convert existing code that was sharing objects which cannot easily be passed from one process to another (file or database handles, etc.). Occasionally, a performance-sensitive application needs more control over the communication channel. In these situations, I might still have to get my hands dirty with the lower-level APIs in the processing.connection module. When that time comes, they are all exposed and ready to be used directly.

Sharing State and Passing Data

For basic state handling, the processing package lets me share data between processes by using shared objects, similar to the way I might with threads. There are two types of “managers” for passing objects between processes. The LocalManager uses shared memory, but the types of objects that can be shared are limited by a low-level interface which constrains the data types and sizes. LocalManager is interesting, but it’s not what has me excited. The SyncManager is the real story.

SyncManager implements tools for synchronizing inter-process communication in the style of threaded programming. Locks, semaphores, condition variables, and events are all there. Special implementations of Queue, dict, and list that can be used between processes safely are included as well (Listing 2). Since I’m already comfortable with these APIs, there is almost no learning curve for converting to the versions provided by the processing module.

Listing 2

#!/usr/bin/env python
# Pass an object through a queue to another process.

from processing import Process, Queue, currentProcess

class Example:
    def __init__(self, name):
self.name = name
    def __str__(self):
return '%s (%s)' % (self.name, currentProcess())


def f(q):
    print 'In child:', q.get()


if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=[q])
    p.start()
    o = Example('tester')
    print 'In parent:', o
    q.put(o)
    p.join()

For basic state sharing with SyncManager, using a Namespace is about as simple as I could hope. A namespace can hold arbitrary attributes, and any attribute attached to a namespace instance is available in all client processes which have a proxy for that namespace. That’s extremely useful for sharing status information, especially since I don’t have to decide up front what information to share or how big the values can be. Any process can change existing values or add new values to the namespace, as illustrated in Listing 3. Changes to the contents of the namespace are reflected in the other processes the next time the values are accessed.

#!/usr/bin/env python
# Using a shared namespace.

import processing

def f(ns):
    print ns
    ns.old_coords = (ns.x, ns.y)
    ns.x += 10
    ns.y += 10

if __name__ == '__main__':
    # Initialize the namespace
    manager = processing.Manager()
    ns = manager.Namespace()
    ns.x = 10
    ns.y = 20

    # Use the namespace in another process
    p = processing.Process(target=f, args=(ns,))
    p.start()
    p.join()

    # Show the resulting changes in this process
    print ns

Remote Servers

Configuring a SyncManager to listen on a network socket gives me even more interesting options. I can start processes on separate hosts, and they can share data using all of the same high-level mechanisms described above. Once they are connected, there is no difference in the way the client programs use the shared resources remotely or locally.

The objects are passed between client and server using pickles, which introduces a security hole: because unpacking a pickle may cause code to be executed, it is risky to trust pickles from an unknown source. To mitigate this risk, all communication in the processing package can be secured with digest authentication using the hmac module from the standard library. Callers can pass authentication keys to the manager explicitly, but default values are generated if no key is given. Once the connection is established, the authentication and digest calculation are handled transparently for me.

Conclusion

The GIL is a fact of life for Python programmers, and we need to consider it along with all of the other factors that go into planning large scale programs. Both the processing package and Parallel Python tackle the issues of multi-processing in Python head on, from different directions. Where the processing package tries to fit itself into existing threading designs, pp uses a more explicit distributed job model. Each approach has benefits and drawbacks, and neither is suitable for every situation. Both, however, save you a lot of time over the alternative of writing everything yourself with low-level libraries. What an age to be alive!