Problem Description

N-Body Simulation	Nicola De Cristofaro

Problem Description

In physics, the n-body problem consists in predicting the individual movements of a group of celestial objects that interact with each other in a gravitational way. The resolution of this problem was motivated by the desire to understand the movements of the Sun, Moon, planets and visible stars.

Solution Description

To find the solution to this problem, it is possible to simulate the behavior of particles, each having an initial mass, position and velocity. The simulation will allow you to calculate the position and speed of each particle at the end of a given time interval.

The proposed solution has a quadratic complexity with respect to the size of the input (the number of particles). There is a type of simulation, called Barnes - Hut, which is more efficient as it is able to execute with order O (n log n) through approximations but it is more complex to implement therefore it was preferred to consider the simpler quadratic complexity solution and concentrate on concepts of parallel programming.

Software used

Message Passing Interface (MPI) : A library used to allow several different processors on a cluster to communicate with each other. In other words, the goal of the Message Passing Interface is to provide a widely used standard for writing message-passing programs.
C language
Amazon Web Services (AWS): to test the parallel program on a cluster of virtual machine create on AWS Cloud environment.

Solution Structure

Initialization

For the initialization phase, the technique of creating the particles in a file and initializing them using a deterministic algorithm to randomize the values of the particles was adopted. In this way all the processors at the first iteration of the computation start reading from this file (particles.txt).

To create and initialize the particles, execute the following commands:

 gcc -o initialization particles_production.c

 ./initialization [number of particles] (example -> ./initialization 1000)

MPI Initialization, variables and memory allocation

The first part of the program deals with the definition of variables useful for computation, variables to take execution times and finally variables for the use of MPI. In addition to the operations always present such as MPI_Init, MPI_Comm_size to know the number of processors that are running or MPI_Comm_rank to know the current processor rank, another operation was made to create a derived data type in order to allow communication between processors of the Particle data type.

    MPI_Type_contiguous(7, MPI_FLOAT, &particle_type);
    MPI_Type_commit(&particle_type);

where the struct Particle is the following:

    typedef struct {
        float mass;
        float x, y, z;
        float vx, vy, vz;
    } Particle;

Equal distribution of workload between processors

The function is then performed to distribute the work equally among the processors. After executing this function in the dim_portions array each index is associated with the rank of a processor and its contents represent the size of the portion of the array on which that processor will have to compute. Furthermore, the array of displacements is set in which each index represents the start_offset of a processor.

    void compute_equal_workload_for_each_task(int *dim_portions, int *displs, int arraysize, int numtasks){

        for(int i=0; i<numtasks;i++){
            dim_portions[i] = (arraysize / numtasks) +
                            ((i < (arraysize % numtasks)) ? 1 : 0);
        }

        int offset = 0;
        for(int i=0;i<numtasks;i++){
            displs[i] = offset;
            offset += dim_portions[i];
        }
    }

Computing and communication between processors

The simulation takes place for a certain number of iterations I set to 10. Furthermore we consider the processor with rank 0 the MASTER processor while the others are considered SLAVES.
The operations carried out in each iteration are the following:

In the first iteration only, all processors read the initial state of the particles from the file. If, on the other hand, it is not the first iteration, it means that the MASTER processor has the particle array updated after the computation of the previous iteration, so it broadcasts it to all the other processors for the new computation.

        if(iteration == 0){
            //First iteration: all the procecssors can read the initial state of the particles form file
            FILE *fileRead = fopen("particles.txt", "r");
            if (fileRead == NULL){
                /* Impossible to open the file */
                printf("\nImpossible to open the file.\n");
                exit(EXIT_FAILURE);
            }

            fread(particles, sizeof(Particle) * num_particles, 1, fileRead);
            fclose(fileRead);
        }else{
            //MASTER processor has the output particle array of the previous computation so it broadcasts
            MPI_Bcast(particles, num_particles, particle_type, MASTER, MPI_COMM_WORLD);
        }

Each processor performs the computation on its portion of the array whose size was calculated fairly in the previous step by calling the bodyForce function which allows to calculate the new position and velocity values of each particle. It should be emphasized that the MASTER processor does not only manage the communication between processors but also performs the computation on its portion.

    bodyForce(particles, displ[myrank], dt, dim_portions[myrank], num_particles );

as we can see from the function code, each processor performs the computation on a subset of particles whose values are computed based on the force exerted by all the other particles.

 /*Function that performs computation on a specific portion of the workload */
void bodyForce(Particle *all_particles, int startOffsetPortion, float dt, int dim_portion, int num_particles) {
    for (int i = 0; i < dim_portion; i++) { 
        float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;

        for (int j = 0; j < num_particles; j++) {
            float dx = all_particles[j].x - all_particles[startOffsetPortion + i].x;
            float dy = all_particles[j].y - all_particles[startOffsetPortion + i].y;
            float dz = all_particles[j].z - all_particles[startOffsetPortion + i].z;
            float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
            float invDist = 1.0f / sqrtf(distSqr);
            float invDist3 = invDist * invDist * invDist;

            Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
        }

        all_particles[startOffsetPortion + i].vx += dt * Fx; 
        all_particles[startOffsetPortion + i].vy += dt * Fy; 
        all_particles[startOffsetPortion + i].vz += dt * Fz;
    }

    //Integration of the positions of my portion
    for(int i = 0; i < dim_portion; i++) {
        all_particles[startOffsetPortion + i].x += all_particles[startOffsetPortion + i].vx * dt;
        all_particles[startOffsetPortion + i].y += all_particles[startOffsetPortion + i].vy * dt;
        all_particles[startOffsetPortion + i].z += all_particles[startOffsetPortion + i].vz * dt;
    }
}

Each processor sends its computed portion to the MASTER so, after this gathering operation, the MASTER processor has the particle array complete and computed for a certain iteration.

    MPI_Gatherv(particles + displ[myrank], dim_portions[myrank], particle_type, gathered_particles, dim_portions, displ, particle_type,MASTER, MPI_COMM_WORLD);

The input of the next iteration must be the particle array computed in the current iteration so since the MASTER processor owns the particle array computed in gathered_particles a swap is performed.

    if(myrank == MASTER) particles = gathered_particles;

Finally, for each iteration the execution time is taken and the MASTER processor writes on stdout the progress of the computation and the time taken for that iteration.

    MPI_Barrier(MPI_COMM_WORLD);
    iterStart = MPI_Wtime();
    ...
    ...
    MPI_Barrier(MPI_COMM_WORLD);  
    iterEnd = MPI_Wtime();
    if(myrank == MASTER) printf("Iteration %d of %d completed in %f seconds\n", iteration+1, I, (iterEnd-iterStart));

Finalization and deallocation

In this last phase the computation is completed for all the iterations, then all the memory previously allocated is deallocated, MPI is finalized with MPI_Finalize () and the MASTER processor writes the average execution time of an iteration to stdout and the total execution time, also writing the final state of the particles on a file for a possible correctness test done later.

Compilation and execution

Compilation

    mpicc -o parallel parallel_nBody.c -lm

Execution

    mpirun -np [number of processors] parallel [number of particles]
    example -> mpirun -np 4 parallel 1000

MAKE SURE THE FILE FROM WHICH PARTICLES ARE READ HAS BEEN CREATED

Correctness

For the correctness of a parallel program it is necessary that the execution with P processors or with a single processor, with the same input produces the same output. To verify this condition, the output of the parallel multi-processor execution was written to file (parallel_output.txt) .
We now run the sequential version of the program which can be run either using the parallel version on 1 processor, or using the sequential version which also avoids the initialization and finalization of MPI.

 gcc -o sequential sequential_nBody.c -lm

 ./sequential [number of particles] 
 example -> ./sequential 1000

OR

```
mpicc -o parallel parallel_nBody.c -lm
```

 mpirun -np 1 parallel [number of particles] 
 example -> mpirun -np 1 parallel 1000

To perform the correctness test we use the sequential version of the program. The output of the sequential version will be written to the file (sequential_output.txt) . A program (output_correctness.c) was then created that compares the contents of the file with the sequential output and of the file with the parallel output to verify its correctness.
After running the program in both the parallel and sequential versions as indicated above, it is possible to perform the correctness test as follows:

 gcc -o correctness output_correctness.c

 ./correctness [number of particles] (example -> ./correctness 1000)

It should be emphasized that as regards the comparisons between the attributes of the particles to establish their correctness, a special function was used to compare the float values. This is because floating point math is not exact. Simple values such as 0.2 cannot be accurately represented using floating point binary numbers, and this limited precision of floating point numbers means that slight variations in the order of operations can change the result. Therefore, since these values are stored with different accuracies, the results may differ. Consequently, if you perform various calculations and then compare the results with an expected value, it is highly unlikely that you will get exactly the desired result. For this reason, with the realized function we can express the concept that two values close enough to each other can be considered equal.
In practice, two float values are considered equal if their difference falls within a certain epsilon limit or value.

// compare two floats and return 1 if they are equal to 0 otherwise
int compare_float(float f1, float f2){

    float precision = 0.01f;
    if (((f1 - precision) < f2) &&
        ((f1 + precision) > f2))
        return 1;
    else
        return 0;
}

Graphic display of the correctness of the parallel solution on small input (5)

Problem evaluation and Benchmarks

To evaluate the performance of the proposed solution, some versions of the programs have been created (sequential: sequential_nBody_benchmarking.c - parallel: parallel_nBody_benchmarking.c) slightly revised, since to perform a better benchmarking the writing of the final output has been eliminated on file to focus on computation time.

We can now proceed with the description of the results given by measuring the scalability of the application. There are two basic ways to measure the parallel performance of an application: strong and weak scaling .

Strong Scaling

In this type of measurement the size of the problem (the number of particles) remains fixed but the number of processors increases.

For the strong scaling test the range of increase in the number of processors has been fixed from 1 to 32 as 32 is the maximum number of cores that can be used with an AWS Educate account

P	N	Avg Iteration Time (seconds)	Total Computation Time (seconds)	Strong Scaling Efficiency (%)
1	100 000	220.005443	2200.054430 ≅ 37 min	100
2	100 000	109.134224	1091.342241	100
3	100 000	73.359361	733.593608	99
4	100 000	54.819589	548.195891	100
5	100 000	43.785572	437.855721	100
6	100 000	36.821113	368.211132	99
7	100 000	31.567646	315.676457	99
8	100 000	27.412049	274.120492	100
9	100 000	24.375618	243.756177	100
10	100 000	21.972933	219.729333	100
11	100 000	20.238017	202.380169	98
12	100 000	18.540094	185.400640	98
13	100 000	17.049654	170.496536	99
14	100 000	15.787466	157.874664	99
15	100 000	14.750758	147.507577	99
16	100 000	13.837183	138.371826	99
17	100 000	12.937212	129.372124	100
18	100 000	12.210964	122.109644	100
19	100 000	11.575181	115.751814	100
20	100 000	10.975263	109.752627	100
21	100 000	10.547308	105.473079	99
22	100 000	9.991014	99.910142	100
23	100 000	9.599492	95.994917	99
24	100 000	9.162451	91.624507	100
25	100 000	8.797949	87.979485	100
26	100 000	8.514978	85.149779	99
27	100 000	8.152149	81.521492	99
28	100 000	7.863964	78.639643	99
29	100 000	7.639826	76.398257	99
30	100 000	7.336128	73.361283	99
31	100 000	7.095153	70.951534	100
32	100 000	6.880448	68.804478	99

From the measurements carried out to verify the strong scalability we noticed how on the same input size, the addition of processors improved the execution time, but obviously for each processor added the improvement was not constant as more processors participated to computation, the greater was the overhead produced by the communication of these. In our test up to 32 processors the execution time always decreased but the more the number of processors increased the more the performance improvement decreased. If we had moved on, adding more processors we would have reached a point where the running time no longer decreased, but began to increase as the overhead produced was greater than the performance improvement.

Weak Scaling

In this case the size of the problem increases as the number of processors increases, ensuring that the workload is always equally distributed among the processors.

P	N	Avg Iteration Time (seconds)	Total Computation Time (seconds)	Weak Scaling Efficiency (%)
1	10 000	1.961227	19.612270	100
2	20 000	3.921892	39.218918	50
3	30 000	5.893506	58.935057	33
4	40 000	7.872926	78.729259	24
5	50 000	10.999243	109.992425	17
6	60 000	13.164654	131.645642	14
7	70 000	15.374346	153.743464	12
8	80 000	17.566746	175.667462	11
9	90 000	19.766140	197.661402	9
10	100 000	21.990926	219.909259	8
11	110 000	24.507350	245.073497	8
12	120 000	26.600296	266.022962	7
13	130 000	28.731199	287.311988	6
14	140 000	30.937833	309.378334	6
15	150 000	33.151154	331.511537	5
16	160 000	35.357112	353.571116	5

The ideal graph of the weak scalability ** performance would be a straight line since the input size has increased in proportion to the increase in the number of processors, so since the workload is always equally distributed, the computation time should always be the same . Unfortunately this does not happen because, as we have already said, a greater amount of overhead is produced for each added processor, mainly due to the communication between the processors. However, from our experiment we had good results because with the increase of the processors and the input size (10,000 more particles for each processor added to the computation) the execution time has increased in a minimal and constant way at each step, remaining relatively close to ideal.

Speedup and efficiency

Speedup is another performance measurement metric that represents the performance improvement of a program due to running parallel versus sequential.

Usually the best we can hope for is to divide the work between the processors without adding additional extra work time. In such a situation if we run our program with P processors then the program will execute P times faster than the sequential program. If this happens let's say the parallel program has linear speedup .

In practice, it is difficult to obtain this type of speedup as the use of more processors introduces in most cases an "overhead" time, ie an extra time mainly used for communication operations between the processors.

Efficiency, on the other hand, is defined as the ratio between speedup and number of processors and measures the fraction of time during which a processor is usefully used.

Below are the speedup and efficiency values calculated as follows:

Speedup

$$ S(P,N) = \frac{T(1,N)}{T(P,N)} $$

Efficiency

$$ E = \frac{S}{P} = \frac{\frac{T(1,N)}{T(P,N)}}{P} = \frac{T(1,N)}{p * T(P,N)} $$

dove :

P = number of processors used

N = input dimension

P N Speedup Efficiency

1 100 000 1.0 1.00

2 100 000 2.0 1.00

3 100 000 3.0 1.00

4 100 000 4.0 1.00

5 100 000 5.0 1.00

6 100 000 5.9 1.00

7 100 000 6.9 1.00

8 100 000 8.0 1.00

9 100 000 9.0 1.00

10 100 000 10.0 1.00

11 100 000 10.9 0.99

12 100 000 11.9 0.99

13 100 000 12.9 0.99

14 100 000 13.9 0.99

15 100 000 14.9 1.00

16 100 000 15.9 0.99

17 100 000 17.0 0.99

18 100 000 18.0 1.00

19 100 000 19.0 1.00

20 100 000 20.0 1.00

21 100 000 20.9 1.00

22 100 000 22.0 0.99

23 100 000 22.9 1.00

24 100 000 24.0 1.00

25 100 000 25.0 1.00

26 100 000 25.8 0.99

27 100 000 27.0 1.00

28 100 000 28.0 1.00

29 100 000 28.8 0.9

30 100 000 30.0 1.00

31 100 000 31.0 1.00

32 100 000 32.0 1.00

We noticed how, surprisingly, both the speedup and the efficiency have always been very close to the ideal, this is because they are calculated according to how many operations we can perform in parallel and in our case we had excellent results since, except for a few more instructions executed by the MASTER processor, all the program instructions were executed in parallel.

Furthermore, the communication between the processors is carried out entirely using the collective communication functions of the MPI library and this is another factor that improves the performance as it makes sure that any node with the information received participates in sending the information to other nodes.

Conclusions

From the considerations made on strong scalability, weak scalability and on speedup and efficiency we can conclude that we have achieved good results and noticed evident improvements with the addition of other processors. In fact, being a problem of quadratic complexity, the sequential execution time was very high (on input size 100 000 the total execution time was about 37 minutes). By employing 32 processors in parallel the total computation time decreased to about 1 minute, thus noting the power of a parallel computation.
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: {inlineMath: [['$', '$']]}, messageStyle: "none" }); </script>

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
benchmarking_screenshots		benchmarking_screenshots
images		images
README.md		README.md
output_correctness.c		output_correctness.c
parallel_nBody.c		parallel_nBody.c
parallel_nBody_benchmarking.c		parallel_nBody_benchmarking.c
particles_production.c		particles_production.c
sequential_nBody.c		sequential_nBody.c
sequential_nBody_benchmarking.c		sequential_nBody_benchmarking.c

P	N	Speedup	Efficiency
1	100 000	1.0	1.00
2	100 000	2.0	1.00
3	100 000	3.0	1.00
4	100 000	4.0	1.00
5	100 000	5.0	1.00
6	100 000	5.9	1.00
7	100 000	6.9	1.00
8	100 000	8.0	1.00
9	100 000	9.0	1.00
10	100 000	10.0	1.00
11	100 000	10.9	0.99
12	100 000	11.9	0.99
13	100 000	12.9	0.99
14	100 000	13.9	0.99
15	100 000	14.9	1.00
16	100 000	15.9	0.99
17	100 000	17.0	0.99
18	100 000	18.0	1.00
19	100 000	19.0	1.00
20	100 000	20.0	1.00
21	100 000	20.9	1.00
22	100 000	22.0	0.99
23	100 000	22.9	1.00
24	100 000	24.0	1.00
25	100 000	25.0	1.00
26	100 000	25.8	0.99
27	100 000	27.0	1.00
28	100 000	28.0	1.00
29	100 000	28.8	0.9
30	100 000	30.0	1.00
31	100 000	31.0	1.00
32	100 000	32.0	1.00

nicolaDeCristofaro/nBody_simulation_MPI_C

Folders and files

Latest commit

History

Repository files navigation

Problem Description

Solution Description

Software used

Solution Structure

Initialization

MPI Initialization, variables and memory allocation

Equal distribution of workload between processors

Computing and communication between processors

Finalization and deallocation

Compilation and execution

Compilation

Execution

Correctness

Problem evaluation and Benchmarks

Strong Scaling

Weak Scaling

Speedup and efficiency

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Languages