Custom Search

Friday, November 15, 2013

Threads vs. processes for parallel processing

Introduction

Concurrency is a hot topic these days, largely because multi-core systems are the norm rather than the exception, and have been for a while. The thing is, multi-core isn't as much of a win for concurrent operations as it is for parallel processing. While the difference is subtle, it's also crucial for deciding what tools you want to use.

Some definitions

So, what is concurrent operations, parallel processing, and the tools I'm talking about? Or more importantly, given that they are used in a number of different ways in the industry, how am I using them?

Parallel processing

Parallel processing means carrying out multiple computations at the same time. Waiting for external devices is not carrying out a computation, so doing computing while IO is going on isn't as parallel processing - at least not in this document.

Concurrent operations

Concurrent operations means doing multiple things at the same time. Unlike parallel processing, concurrent operations includes doing IO. In fact, doing nothing but IO can qualify. Parallel processing is a form of concurrent operations, but I'm going to exclude parallel processing from the definition of concurrent operations in this document as a matter of convenience.

Thread of execution

A thread of execution is a series of instructions that are to be executed by the CPU in order. The operating system may interrupt the thread and allow other threads of execution to operate between instructions in a thread. This really isn't a term used in the industry, but I wanted a term that covered both threads and processes. I'm going to use it to avoid saying threads and processes or thread or process all the time.

Process

A process is a thread of execution plus an address space. The address space is not shared with other processes. Unix was a revolutionary step forward for processes. Before Unix, processed tended to be expensive and painful to create. Users usually got one process, and running a program would involve the command processor (which was running in the users process) loading the executable into the users address space and then calling the executable as a subroutine. It was possible to do concurrent operations in a process, and even have multiple threads of execution active, but it involved executable images specifically built for the purpose, since loading a standard executable would destroy the one that was running. By making processes easy and cheap to create from standard binaries, Unix allowed the command processor to run in it's own process, meaning that concurrent operations could be done by simply having the command processor not wait for a program to finish before prompting the user for the next command.

Thread

A thread is a popular - and very effective - tool for doing concurrent operations. A thread is just a therad of execution that shares its address space with other threads. Threads work very well for concurrent IO, since you can just code the IO normally, and then sync with the main process when it's done. For parallel processing, not so well. I'm going to discuss those problems next.

The problems with threads

Using threads creates a couple of problems that using processes avoids.

Shared everything

Actually, there's only one problem with threads, but it creates a number of issues. The problem is that all your data is shared implicitly. The immediate affect of this is that you have to figure out which of your data objects might be modified in different threads, and do something to make sure that doesn't happen concurrently.

In my experience, the biggest source of concurrency problems in threaded programs is programmers who didn't realize that some object they were modifying could be modified by another thread. I have seen the claim - from people I trust and respect - that a properly designed programming language (meaning not C++ or Java) and programming practices can avoid these problems. I haven't used that language, and find it hard to believe that masses of programmers will all follow proper practices. Maybe if a single entity controlled the language and everyone learned from their examples - which may well be the case for this language.

Elsewhere, functional programming languages that make most data immutable are catching on as a solution, allowing either the language runtime or type system to catch such errors.

Once you've identified those places, you've got to arrange to provide the appropriate interlocks. Language constructs that do this automatically are another hot research topic.

Fragility

If your program does something completely unexpected, then you may well have no idea what state it's in. The question becomes - how do you deal with this situation? If you are using threads for concurrency, then you have to worry about what the rogue thread may have done to data being processed by another thread, even if it they were logically unconnected - one of the consequences of implicitly sharing everything. To be reasonably sure that everything that might have been corrupted is discarded, you have to discard all the threads in the process.

If you using processes for concurrency, discarding that process provides the same level of assurance, but that's only one execution threads worth of work.

So using threads makes your programs fragile.

No distributed processing

If you're dealing with any significant amount of data, you're going to want to have multiple machines working on it. If your application is built using processes for concurrent operations and sockets for data sharing, this is trivial. All you have to do is add some kind of rendezvous mechanism.

If you're using threads, on the other hand - well, you pretty much have to start over to get any communications between the execution threads at all.

Sharing data between processes

Given that processes don't share an address space, the obvious issue is how do you get data between them. I've actually had proponents of threads claim that you couldn't share data between processes without serializing it (though to be fair, they were probably trolling). While that is one solution - and pretty much required for large problems - there are alternatives, each with their own downside.

The important thing to realize is that any non-trivial program will have multiple types of information to be shared, each of which will have it's own requirements. Using threads provides a single solution you have to use, even if it doesn't fit the requirements for the data at hand. If you're used to that model, you'll need to get past it and think about how you're actually using your data.

Explicit sharing

You can always share memory explicitly between processes. This provides the same level of communications as threads, without the nasty implicit sharing of everything. The downside is that I know of no language that provides language-level support for insuring that everything you need to share winds up in the shared memory segment. This means you are in practice limited to machine primitives composed with arrays and records. In my experience this hasn't been a problem, as most of the cases where you need this level of communications use such data.

Serial communications

You can also serialize the data and write it outside of your process. This is required if you want to distribute the processing of your data to more than one machine. The downside is that - well, compared to shared memory, it's expensive. You have to serialize and deserialize the data, and copy it between the processes in serialized form. If you have lots of communications going on (not the same thing as lots of data - passing one byte back and forth a million times is as bad as passing a megabyte back and forth once), this won't work very well. Unless, of course, you more data than will fit in the address space, in which case you pretty much have to do things this way.

Disk communications

An interesting variant on the idea of serial communications is storing the data on disk. This has the advantage of preventing data loss.

You store each chunk of data in a file in a queue directory. Each worker process scans the queue directory, trying to get an exclusive lock on each file until it succeeds in doing so. It processes the data therein, writing the output to a locked file in a second queue directory. When it's done, it can unlock the output file and then remove the input file. There's no chance of data loss, though you could wind up with duplicated data. Adding a work directory between the two queues reduces that problem, but then requires code to clean up the work directory at startup, moving input files back to the input queue and removing output files.

Inheritance

If you have a data structure small enough to fit in memory that all the processes need read access to, and any writes are not shared between processed, inheriting the data by creating it in the parent and forking to pass it to children works very well.

Configuration data for the application often fits this mold. Having all the configuration data handled by the parent - rather than having it in distinct files and the parent passing along the file name - means the parent can verify things between processes, like that the output of one process is correctly connected to the input of the next. Better yet, it can provide intelligent defaults for some values, and share configuration information between process, which has resulted in reducing configuration information by 99% in at least one case.

The downside is that the communications is one-directional. If you need to communicate a value back to the parent, you'll have to find another method to do so.

Exit status

This is another one-directional channel, and is very narrow - basically a single byte. While it doesn't seem very useful, it's used all the time in Unix systems. In every conditional statement in a shell script, the condition being tested is the exit status of a child process (unless it's a builtin, anyway). Any application that does process-based concurrency needs to deal with these, if for no other reason than to check that the child process actually finished properly.

A real-world example

I was involved in refactoring a credit card payment processing system from a single monolithic, threaded program into a distributed application.

Being distributed was part of the requirements, as one of the goals of the refactoring was to decrypt the financially sensitive data on machines that weren't exposed to the network. So serial communications was used to between hosts on the periphery of the network and the internal hosts with the crypto hardware.

The configuration information on each host was created by the parent process and inherited by the workers.

Each machine had a state table for all open connections to it. This was explicitly shared between the processes, so that the parent could correctly deal with child process failing, so that 1) no transaction was reported as completed until it was actually completed, and 2) no transaction was processed twice. Further, it was shared via a file mmaped to disk, so that these same guarantees could be made if the entire machine failed.

Miscibility

OK, so why not just mix and match threading and processes? Well, there's one last little issue with everything being implicitly shared. In order to fork safely, you need to know things about the state of all the locks - both external and internal - in your program. Getting that information becomes much more difficult if you have locks being used in multiple threads - and when do you use threads without some form of locking?

While not every case where a lock is held by a thread that isn't created in the child and hence won't be freed in the child is a disaster, it is in a lot of them. If an internal lock is locked by another thread when you fork and you need it, you're hosed because you'll never be able to get it. External locks you may or may not be able to get, even if you shouldn't be able to if the parent has them, and freeing them may or may not cause them to be freed for everyone, even if the parent is still using them.

Mixing threads and forks arbitrarily can be done, but - well, it's not at all clear it's less trouble than simply staying with threads for everything. On the other hand, if you limit each process to one or the other - parents never create threads but use forks to create workers, which use threads if they need them, but never fork - then you'll be OK.

Except, of course, if you're building something like a CI system, where what the workers do is fork off processes to build things. In that case, workers should avoid using threads.