When you create a system, you want it to run as quickly as possible. Most of the time, this isn’t a problem, but eventually you’ll find yourself in a situation where the performance isn’t as good as you’d like or need.
At this point you have a few options: at first you might decide to throw more CPU power and memory at the problem. This is usually the approach people mean when they tell you that “CPU and memory are cheap!”. If you do this, things will probably improve for a while, before you’re back in the same position again. If you’re using a cloud-based system, you’ll also have a bigger recurring cloud bill and less money in your bank account. So, as always, it’s best to ignore anyone who tells you how “cheap” CPU and memory are.
The first thing you should ACTUALLY do when your system isn’t running fast enough is to make sure you’re using efficient algorithms. As luck would have it, there’s an article about that on this site: Big O Notation.
Once you’ve improved the algorithms, assuming your system still isn’t running as quickly as you’d like, you might be thinking of going back to the idea of adding more CPU, just don’t! We’ve covered that already – it’s a VERY last resort.
There are ways to make your code run faster (or sometimes slower) than you ever thought possible. If you want to see what I’m talking about, take a look at your system’s task manager or resources monitor while your software is running. You’ll probably see one of your CPU cores running saturated. Looking closer, there’s probably a lot of capacity available on the other cores. Using that processing capacity will let you run your software faster, on the same hardware, and all you have to do is change your code to use more than a fraction of your CPU!
The Basics
A Brief History Lesson
The first mainstream CPUs executed one instruction at a time, at a speed dictated by the system clock. That worked pretty well, and performance increases came with expanding data bus widths, reduced transistor size on the chip and increasing clock frequencies. In the 1990s, things looked like they might run out of steam. It became necessary to take a different approach then , because with clock frequencies in the GHz range, these CPUs were dissipating a couple of hundred watts in heat alone. Hyperthreading was introduced by Intel in the early 2000s and is still used in their processors. It’s a way of making every physical CPU core look like two “virtual” cores, which work together to share CPU resources and improve performance. It’s a clever technology, but from the software engineering perspective you simply see more cores which was great when most systems had single core CPUs – suddenly there were two! Of course, there were limits to how much performance hyperthreading can add – improvements depend upon the type of software being run, with CPU-intensive tasks being hard to improve on.
What really moved things along though was multiple CPUs in a server, actually in multiple sockets on the PCB, followed quickly by multiple CPU cores on a single die. This allowed chip frequencies to be dropped down a little, which reduced the heat generation, and allowed multiple processes to run simultaneously.
Why Did You Imply The Frequency Causes Power Problems?
Because it does.
Why?
CPUs, and most other modern ICs, and indeed some pretty fancy audio amplifier transistors, use a technology called “Complementary Metal Oxide Semiconductor” or CMOS. This technology has some interesting features, the most important being that transistors built with it have very low resistance when they’re fully “on”, and very high resistance when they’re fully “off”. These are the only two states used in CPUs, so it sounds reasonable so far. The problem is that almost all of the energy, and therefore heat, is used when the transistors switch between “on” and “off”. There’s not a lot of energy involved, but it all adds up. For example, let’s do some (rough) calculations using the figures for a 2005 processor, the Intel Pentium 4 Extreme Edition:
Each time a transistor changes state, it uses a fraction of a Femtojoule (10-15J, or 0.000000000000001J).
The Pentium 4 Extreme Edition contains around 200,000,000 transistors.
The clock runs at around 3.7GHz, or 3,700,000,000Hz, which was around the maximum at the time.
Multiplying everything up, we get:
1015J x 200,000,000 x 3,700,000,000Hz
=> 740W
This is some pretty rough estimation, assuming every transistor changes state on every clock edge, but it shouldn’t be too much of a surprise to find that this particular chip was capable of dissipation 220W when things got busy on it. Considering the surface area is only 135mm2, that’s an energy density of over 1.6MW/m2. This can cause a LOT of problems, and things obviously couldn’t continue like that. If you want even more details, I recommend reading Herb Sutter’s essay “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software”.
Anyway, Moving On...
Herb Sutter’s essay was originally published in 2004 - multi-cored processors have been around for quite some time – but so much code is still single threaded. Why is this? I’ve heard a lot of reasons over the years why developers avoid concurrency, including these:
- It’s too complicated
- It never works
- We don’t need it
- It’s not worth it
- You can’t debug it
- It’s too scary
- It won’t make any difference
- We’re only running on one CPU
- Our staff don’t have the skills
I might be a little judgmental, but none of these arguments ever struck me as particularly good excuses for not moving away from single-threaded processes and making use of the multiple cores on modern CPUs.
What’s Concurrency?
Concurrency is where multiple threads or processes are executed simultaneously. This doesn’t mean they are executing at exactly the same moment – they might be run in very short bursts, such as when the CPU switches between threads at very short intervals on a single core CPU. This technique is called time slicing. It also includes the situation where multiple threads or processes are running simultaneously on multiple cores. As far as an observer is concerned, in both situations, the threads are running at the same time.
This is a concept that takes a while for a lot of people to grasp, so it might help to use a few diagrams to illustrate it.
Let’s start with the “traditional” serial style of operations:
In theory, assuming we have plenty of processor cores, or at least more cores than we have processes, running x tasks in parallel will take time t / x :
However, there’s always some overhead in starting new processes or threads, so the time is actually somewhere between t / x and t :
In any case, even if the execution time is almost t, it’s still an improvement over single-threaded execution. This is even more true when you consider modern processors often have slower clock speeds than older ones, so there’s even more speed to gain by distributing the work across multiple stream.
The Types Of Parallel Processing
You might have noticed I referred to multiple “streams” of work above. This is because there are three main types of parallel programming, and depending upon the circumstances any of them could be the best.
Multiprocessing
Multiprocessing is one of the original forms of concurrency, and involves having a completely separate process for each part of the system. This isn’t particularly unusual; if you have a database application with a web server on the same machine, they’re both running as separate processes and communicating with each other as required. This is a multiprocessing system. If you have the processes on separate servers, then it’s a distributed multiprocessing system. The principles are the same with both, but distributed systems involve more networking and a bigger hardware bill!
In general, multiprocessing tends to be used for “heavier duty” CPU-intensive processes, or processes that need to run for a long time. Applications like web servers, databases, business servers and microservices can all be implemented using multiprocessing. Like anything else there are pros and cons to running separate processes:
Each process must be contained within its own executable file, and must have its own start-up and configuration logic.
If you want to be able to communicate between process with more detail than “Process X is running/stopped”, then some form of communication between processes has to be included. It might be named pipes, an API, files, or some other method.
It generally takes a long time, in processing terms, to start a process. Its code must be loaded and and execution space must be set up before it starts. If you want to run something hundreds of times a second, separate processes might not be the best method to use.
The memory and execution spaces of each process are totally isolated. If something goes wrong with a process, it cannot access the memory of any of the other processes, and it can’t crash them.
Usually, but not always, the operating system will try to put a separate process on a separate CPU core, or group of cores.
Multithreading
Threads are a sort of “sub process” – they’re contained within a process and they have access to the same resources as other threads within the same process. This means that communication between threads is easier that between processes – shared memory is probably the most straightforward.
At the processor level, threads usually execute on the same processor core as the process that owns them. In the .NET framework though, managed threads are not the same as CPU threads, and may be on different processor cores.
Separate threads are useful for CPU-intensive tasks which should finish before or at the same time as the parent process. Tasks that involve complex calculations are often spawned as separate threads. Even more commonly, in Windows it’s common to run an application’s UI and the rest of the application on separate threads. This way you avoid the dreaded “Application Not Responding” heading on your application windows whenever there’s any kind of delay caused by processing.
Communications between threads are simpler than between processes – shared memory and other, system-specific, structures are usually all you need.
All of the code for the threads is usually in the same executable as the main process. This makes it generally faster to start a thread than a process.
Asynchronous Code
Some more modern languages such as C#, Python 3 and JavaScript have async/await-type constructs. These features allow you to specify a function as async, and to call it using await. This flags to the underlying compiler or interpreter that a result from the function might not be immediately available. This means that while the system is waiting for the result from one function, it can go and carry out other parts of the program.
Usually asynchronous code only works if the code is “asynchronous all the way down”, i.e. if the operating system functions that are called at the lowest level are asynchronous themselves. This is usually the case for I/O functions such as file or network operations, an not usually the case for CPU-intensive operations.
This means that asynchronous code is great if you’re working with a system that does a lot of I/O.
Some Other Points
In both multiprocessing and multi-threaded systems, the operating system takes care of which CPU and/or core to run the code on. Some operating systems let you provide “hints” as to where this should be, but generally it’s not worth the trouble. The operating system keeps track of the more and less busy processor areas, so it will usually do a good job of maximizing the CPU usage.
What’s The Best Language To Use?
Most of the more popular languages are good choices for some form of concurrency. Languages like C and C++ were created before multiple CPUs and multiple cores were in general use, but C++ now has its own multi-threading classes, including the std::thread class. It’s come quite a long way since I had to implement it myself in Windows with classes and a virtual thread execution method!
C# is also a good choice, allowing all three techniques described here to be easily used.
Python is a little less useful. It handles multiprocessing well, because it just makes a C call to the operating system. Threads are restricted by the interpreter’s GIL (Global Interpreter Lock), which stops simultaneous access to Python objects. There are plans to potentially remove the GIL from the standard Python interpreter, although that would involve making the interpreter itself thread-safe. This means that you should only bother with Python threads, or asynchronous code, if you’re using a lot of I/O. For CPU-intensive tasks, use multiprocessing because Python threads won’t help you at all.
Implementing Concurrent Software
You might get to this point and decide that the best thing to do is to redesign your application to spawn as many threads or processes as it needs. An app controlling 64 communications links? 64 threads sounds good. Processing data as it arrives at your application? Why not spawn one thread or process per set?
You might even implement your system like this, set it running, and watch as its performance heads swiftly down the drain, maybe even with the sort of clogging up you really wouldn’t want to see in your drains.
It’s All Perfectly Logical – Why Did It All Go Wrong?
The problem you’re facing at this point is that there are limited resources in the system. If you have, say, eight cores on your machine you can assume that maybe two are occupied by the operating system. That leaves you six cores, and in the example above you started 64 threads. Even without all the other processes running on the system, you’re now trying to run more than ten threads per core. The operating system will usually be time slicing them, but each context switch also takes time as registers and data are swapped. In addition, any inter-thread or inter-process communications and thread safety tasks are run more often.
All of this means that if you spawn enough threads or processes, your system will slow dramatically as the time taken for the CPU to carry out administrative tasks approaches the time it has to actually run your code.
What’s The Solution?
As tempting as it is to run lots of separate threads and processes, don’t. Limit their number to what you need, rather than what seems good. Divide the tasks your system carries out into discrete jobs, and then run them a few at a time. It sounds counter-intuitive, but it works.
Most languages have built in or downloadable schedulers that you can use. They will allow you to add jobs to a queue, and will look after scheduling them. You can see schedulers like this in action when you use a database server – if you throw in thousands of requests together, the server will just queue them and run them when it can.
Summary
Concurrency, in any of its forms, is usually a great way to improve the performance of your systems. It’s not always simple – this article doesn’t even touch on the extra pieces of code you’ll usually need like mutexes, semaphores and locks – but even with those, it’s worth the extra effort.