JOIN Introduction to OpenMP, Part 2

By d000hg
TopCoder Member

In Part 1, we looked at an overview of multi-threading, introduced OpenMP, and began to review its directives. Part 2 continues with a look at the reduction directive.

Reduction: Let's imagine we want to sum the elements in an array. We might have some code which looks like this:
```int sum=0,i;
float array; //imagine this is pre-populated
#pragma omp for shared(sum, array)
for(i=0;i<100;++i)
{
#pragma omp atomic
sum+=array[i];
}
```
This is perfectly correct, in that it functions as expected (note the use of synchronisation directives). However, it is not an optimal technique, because atomic operations are slow compared to the operations they replace. What reduction does is to create a private copy of a shared variable for each thread, into which the values found for all iterations performed by that thread are combined (by summation, for example). At the end of the parallelised loop, the private copies of this reduction variable are then 'reduced' back into the original shared variable. This might sound a little complex, so I'll try to use examples to clarify it

First, let's see the syntax for using reduction variables:
```Reduction (op: variable-list)
```
Now a simple example, reimplementing summation:
```int sum=0,i;
float array; //imagine this is pre-populated
#pragma omp for shared(array) reduction(+:sum)
for(i=0;i<100;++i)
{
sum+=array[i];
}
```
The main thing that's been accomplished is that we no longer need synchronisation. So what's going on here?

We've declared sum as a reduction variable for addition. There are 3 parts to how this works:
1. For each thread, a private copy of sum is created and initialised, based on op as follows:

 op Initial value + 0 * 1 - 0 & -0 | 0 ^ 0 && 1 || 0

2. Each thread will perform some subset of the iterations done by the loop, and each thread's copy of sum will contain a summation of array[i] for that subset. For instance if we had two threads and one processed i=1,3,5,7,…,99 and the other processed i=0,2,4,6,…,98 then the two copies of sum would end up with one containing a summation of array,array,array,… and the other a summation of array,array,array,…
3. Finally the thread-private versions of sum are reduced back into the shared sum variable. This is done by using the operator specified in the reduction clause - we've used '+' in this case, which means the values will be added.

Some things to note about reduction variables:
• Reduction variables cannot also be present in the private/shared modifier lists
• Many reduction variables can be used with different operators but each variable may be listed only once - e.g. #pragma omp for reduction(+: x,y) reduction(||: z) is quite allowable.
• Reduction variables must be of scalar (simple) type.
• The reduction operator must be valid for the type of the reduction variable.
• Pointer and reference types are not allowed.
• Within the body of the parallel loop, you have a lot of freedom with your reduction variables. You don't have to use the same operator as op - in fact, you can even use function calls instead of operators.
Reduction is in theory quite simple. However it's quite an unusual concept in the scope of C++, which makes it hard to explain. In most cases, though, you probably want to use a reduction variable to avoid synchronisation, in which case the reduction operation will just be the operation performed in the loop - if you're summing values use '+', if you're bit-wise and-ing values use '&', and so on.

#pragma omp parallel for
Often, all you want to use OpenMP for is to parallelise a loop. Quite frankly it's a bit messy to declare a parallel section using #pragma omp parallel, and then within that to declare a work-sharing loop using #pragma omp for. Luckily, OpenMP provides a nice neat way to accomplish this using a combination of the two constructs - #pragma omp parallel for. There's really nothing new to this construct - let's modify Example 5 to use it:

Example 5b:
```void example5b()
{

int val;
//we'll use our 4 threads to calculate & store the cubes of 1 - 9
for(int i=1;i<9;++i)
{
val[i-1] = i*i*i;
#pragma omp critical
{
}
}
}
```
All that this construct does is squash the parallel region declaration and the work-sharing loop declaration together. It saves you a few lines and makes it clearer what's going on. Particularly while you're getting to grips with using OpenMP, I would strongly recommend using this form.

#pragma omp sections & #pragma omp section
When I first saw how easy OpenMP made loop-parallelisation, I was immensely impressed. I was able to use it effectively in a couple of marathon matches in which the core of my algorithm came down to performing a few loops over some pretty big data sets. However, anyone who's done some standard multi-threaded programming will probably already have realised that #pragma omp for is pretty restrictive in the styles of multi-threaded program that can be written using it.

There are two major ways to parallelise an application - the first is simply to distribute loop iterations over multiple threads, which we've just seen. But often an application won't be performing the same operation on some huge data set, but may instead have a few distinct tasks to do. One example is a video game, in which the central loop might look a little like this:
```void playGame
{
while(!GameOver())
{
UpdateGamePhysics();
UpdateGameAI();
Draw();
}
}
```
In traditional multi-threaded programming, a programmer might look to create two additional threads so that physics and AI calculations can be done in parallel - the following is a very simple illustration of such an approach...
```void playGameMT
{
while(!GameOver)
{
Draw();
}
}

void UpdateGamePhysicsLoop()
{
while(true)
{
UpdateGamePhysics();
}
}
void UpdateGameAILoop()
{
while(true)
{
UpdateGameAI();
}
}
```
In this example, CreateGameThread is a function that creates a thread; this thread simply runs the function passed to it. What this code demonstrates is a rough idea of how multiple threads can be used to distribute work over more than one core, in a very different way to what we've seen so far. In theory you could conceive of some way of coding this in such a fashion that loop parallelism might be possible, but luckily you don't need to - OpenMP provides a mechanism for doing exactly this! In fact the same multi-threaded approach implemented using OpenMP might look a little like this:
```void playGame
{
while(!GameOver())
{
#pragma omp sections
{
#pragma omp section
{
UpdateGamePhysics();
}
#pragma omp section
{
UpdateGameAI();
}
#pragma omp section
{
Draw();
}
}//end of sections construct
}
}
```
The way that OpenMP allows us to perform several sections of code in parallel uses two constructs. Firstly we have the section construct. This informs OpenMP that the following code block is a 'chunk' of code that should all be executed by a single thread.

Second is the sections construct. This construct should contain a set of section constructs, and informs OpenMP that they should be executed in parallel. As usual, it's probably easier to see an example...

Example 6:
```void example6()
{
{
#pragma omp sections
{
#pragma omp section
{
for(int i=0;i<3;++i)
{
val[i]=1<<i;
}
}
#pragma omp section
{
for(int i=3;i<6;++i)
{
val[i]=1<<i;
}
}
#pragma omp section
{
for(int i=6;i<9;++i)
{
val[i]=1<<i;
}
}
}
}
//output the results AFTER the parallel region
for(int i=0;i<9;++i)
}
```
Results from Example 6:
2^0 = 1, calculated by thread 0
2^1 = 2, calculated by thread 0
2^2 = 4, calculated by thread 0
2^3 = 8, calculated by thread 1
2^4 = 16, calculated by thread 1
2^5 = 32, calculated by thread 1
2^6 = 64, calculated by thread 2
2^7 = 128, calculated by thread 2
2^8 = 256, calculated by thread 2

What we're doing in this example is very similar to Example 5, except that we're calculating powers of 2 instead of cubes. However, instead of parallelising a single loop we're performing three distinct pieces of code - each section will be executed by a single thread. Although in our simple example we actually do something that a parallel loop would be well suited for, the point is that each section can contain completely different functionality - one section could calculate the cubes of the first 100 integers while a second computes Fibonacci numbers and a third reads some data from a file.

There are some important things to note about using the sections work-sharing construct:
• The maximum number of threads it can use is normally not greater than the number of section constructs - each section is a distinct piece of work so cannot be shared across multiple threads. In our example we allow OpenMP to create up to 8 threads, but only one for each section will actually be used - 3 in total. This can place limits on the scalability of your application for systems with many cores - a parallelised for loop can utilise as many cores as the loop has iterations, but our sections example is always limited to using 3 cores whether the system has 3 or 33.

NOTE: Some OpenMP implementations allow nested parallel constructs. For instance in each section, you could parallelise the loop. In this situation you may utilise more threads than there are sections because the code in each section is now parallelised. However this is advanced material beyond the scope of this article.
• You may have more sections in a work-sharing construct than the maximum number of allowed threads, in this case OpenMP will simply start of with the maximum number of threads, taking a section each and then allocating further sections to threads as they become available.
• The default behaviour is that there is a barrier present at the end of the sections construct - if you have two consecutive sections constructs in the same parallel region then the second will not begin until the first has completed - even if there are unused threads with nothing to do during the first sections construct. For instance if, in our example, we had a second sections construct also containing 3 sections, then 3 threads would execute the first sections construct and would only execute the second sections construct after all 3 had finished.
NOTE: It is possible to change the default behaviour of the for and sections constructs with regard to barriers, using the NOWAIT modifier clause. This falls beyond the scope of this article however.

#pragma omp parallel sections
As #pragma omp parallel for is to #pragma omp for, so #pragma omp parallel sections is to #pragma omp sections. It simply combines a parallel region declaration with a single sections construct, saving you a little typing and making your code easier to read. Here's the code from Example 6, altered slightly to use this construct:
```void example7()
{
#pragma omp parallel sections default(none) shared(val,thread)
{
#pragma omp section
{
for(int i=0;i<3;++i)
{
val[i]=1<<i;
}
}
#pragma omp section
{
for(int i=3;i<6;++i)
{
val[i]=1<<i;
}
}
#pragma omp section
{
for(int i=6;i<9;++i)
{
val[i]=1<<i;
}
}
}
//output the results AFTER the parallel region
for(int i=0;i<9;++i) 