Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling for red-mot #9

Closed
ElliotB256 opened this issue Feb 9, 2021 · 27 comments
Closed

Profiling for red-mot #9

ElliotB256 opened this issue Feb 9, 2021 · 27 comments
Labels
enhancement New feature or request

Comments

@ElliotB256
Copy link
Collaborator

Starting to profile the latest changes to red-mot to get a head start on optimisations.

Interestingly, it seems the vast majority of program time was being taken up by rayon implementations - specifically SwitchToThread and SleepConditionVariableSRW. There was also poor spin time.
image
This seems likely related to a known issue in rayon, where sleeping threads would burn a lot of cycles/power. For example, see
rayon-rs/rayon#642
rayon-rs/rayon#795
rayon-rs/rfcs#5
It looks like some of these issues were fixed in rayon 1.4.0, although we are currently on rayon 1.3.0. I'll try updating rayon (new version is 1.5.0) and see if it improves things.

@ElliotB256
Copy link
Collaborator Author

Upgrading removed the spin time from SwitchToThread, which is nice.

Hotspots are now as below:
image
It's encouraging that 'pow' and 'log' (two elementary mathematical functions) are already some of the bottlenecks in the program. We are still not properly load balanced (this explains some of the context switching cost/sleep condition), see eg over 8 threads:
image

The pow:
75.6% of the pow time is taken up calculating the -3rd power of the laser frequency:
image
It is straightforward to precalculate this instead and store it in the struct.

@ElliotB256
Copy link
Collaborator Author

The log:
The log is all coming from Poisson dist in photons_scattered.rs. It's only 2.8% of program time.

laser::rate::...::run

  • Another 2.4% is taken up by normalising the laser beam direction.
  • also calculating c^2 in prefactor.
    image

laser::force::...::run:

  • More time taken up by normalising direction of gaussian beam.
    image

@ElliotB256 ElliotB256 transferred this issue from another repository Feb 9, 2021
@ElliotB256
Copy link
Collaborator Author

I'll add inner parallelism to laser::rate and laser::force - it should help reduce the costs from sleeping threads and also reduce these systems.

@ElliotB256
Copy link
Collaborator Author

Adding inner parallelism for laser::rate reduced wall time from 63s to 51s and improves load balancing over processors (still not ideal, but better - compare to above).
image

@ElliotB256
Copy link
Collaborator Author

Implemented inner parallelism for laser::force::CalculateAbsorptionForcesSystem, time from 51->46s
image

@ElliotB256
Copy link
Collaborator Author

Introduced precalculation for rate_prefactor to avoid f64.pow(-3f), and some other operations. 46s->42s.

@ElliotB256
Copy link
Collaborator Author

Tried removing ecs barriers, doesn't make much difference (~42s both times)

@ElliotB256
Copy link
Collaborator Author

Benchmark was being limited by writing to file, so I removed file output - 42s->37.7s

@ElliotB256
Copy link
Collaborator Author

It's possible that with 10k atoms there just isn't enough data to run across 6 cores with reasonable load. If I turn up the atom number in the benchmark to 100k atoms it is very clear that parallel systems now achieve proper load balancing:
image

To remedy this, I made a number of vector initialisations (eg samplers) inner parallel and also the actual photon number calcs. Time went from 43s->30.5s with improved load balancing (shown is about ~4 steps) for 100k atoms.
image

@ElliotB256
Copy link
Collaborator Author

Counterintuitively, this gives significantly better performance at lower atom numbers. 🤔

For 10k atoms at 5k steps, the benchmark time goes from 37.7s->22s. Much better load balancing
image

@ElliotB256
Copy link
Collaborator Author

I'm profiling on an i7-8700 which is a 6 core, 12 thread PC.

Running w/12 thread pool:

image
17.8s of CPU time is lost to spinning. Breakdown of CPU utilisation:
image
Total time = 20.4s

Running w/10 thread pool:

image
11s of CPU time lost to spinning (SwitchToThread). CPU utilisation:
image
Total time = 20.6s

Running w/6 thread pool:

image
5.9s of CPU time lost to spinning (SwitchToThread).
image
Total time = 22.6s

Running w/4 thread pool:

image
Total time = 27.3s

Running w/2 thread pool:

image
Total time = 40.7 (pretty much similar to what we had when we had poor load balancing before optimisation).

This is pretty much as expected - hyperthreading does not really benefit the simulation, what matters is the real number of cores on the machine. Futhermore, one thread seems to sleep (one in the pool gets used up as a main thread, which doesn't do much). Hence, running > 6 threads doesn't really give much advantage on my 6 core PC.

@ElliotB256
Copy link
Collaborator Author

87b1d38

@ElliotB256
Copy link
Collaborator Author

More profiling fun.

I noticed there are some heap allocations which are taking up a lot of time. They come from the various vec! Vec<T> allocations in the program, which store memory on the heap (see eg https://doc.rust-lang.org/std/vec/struct.Vec.html#guarantees). These are being used mostly for samplers (due to having an unknown number of laser beams).

It does seem somewhat counter-productive to the general ECS data pattern to use heap-allocated vec components. I am tempted to try a branch where we cap the number of laser beams (eg 16? some user-defined variable) and instead use fixed-size sampler arrays. It's more memory allocation, but having that memory not heap allocated might just be faster. Fairly straight forward to test it.

@ElliotB256
Copy link
Collaborator Author

Still more profiling, and it's a big one!

Following the above, I removed all heap-allocated vector components (ie, those through vec!) and instead implemented fixed-size arrays (nominally supports up to 16 lasers - its just a constant you can tweak). Details in 5bfddc1.

The performance increase is staggering! Runtime is now 10.6s, down from 21.5s - another factor of two in speed. The effective CPU utilisation is extremely good.
image
image

One caveat - I broke something during the changes! So, the simulation is not actually giving the correct physics right now. Nonetheless, the same number of systems are running, so I believe we can believe the performance. It's probably worth fixing the red-mot-perf branch and merging it back onto red-mot.

@ElliotB256
Copy link
Collaborator Author

Looking at this again, some of the speedup was because poisson dist was no longer being calculated due to a nan error. In 1c5fd I implemented LaserSamplerMasks to remove the nan which occured when summing rates for undefined lasers.

There is a performance gain, but not quite as substantial as before - 17.7s compared to 22.5s using vec!. We are losing a lot of time to spinning, so I'll try and tweak that. Maybe too much par_for.

@ElliotB256
Copy link
Collaborator Author

In 8558d17 I improved system-level parallelism, which shaved time from 17.7s to 16s. Still about 25s total CPU time on spin, which probably costs around 2s of wall time.

@ElliotB256
Copy link
Collaborator Author

If I remove the Poisson dist and instead just return lambda, time goes from 16s->12.2s. I still have a lot of spin time though. At the microscopic level, this is clearly due to a few single threaded systems blocking while they complete.
image

@ElliotB256
Copy link
Collaborator Author

Other things tested -

  • Using lto = true in the cargo.toml under release profile to give fat linking. Doesn't give any speedup on benchmark, although compile time ends up about 1minute rather than 4s.
  • Using native target flag in the rust compilation, eg RUSTFLAG=-Ctarget-cpu=native. Simulation time goes from 18s to 19s.
  • Compiling with llvm-args=-ffast-math, to better hint loop vectorisation for the compiler. No difference.

@ElliotB256
Copy link
Collaborator Author

Microarchitecture profiling

Microarchitecture profiling with a laser sampler COOLING_BEAM_LIMIT (determined cache size)=16 gives:
image
Total runtime = 17.9s.

If we reduce the laser sampler COOLING_BEAM_LIMIT to 8 we instead find:
image
Total runtime = 14.45s.

A few more notes for future optimisation:

  • loop vectorisation will help for the core-bound instructions. It should be possible to vectorise pow and log (used by the Poisson dist and currently a large fraction of simulation time).
  • A number of sampler arrays are initialised to a zero value every frame. Now we have the masks, we can instead leave unused entries as 'dirty' and not reset them to zero (they wont be used in the calculation anyway if the mask is clear).

@ElliotB256
Copy link
Collaborator Author

Removing sampler initialisation and instead relying on mask and leaving unused data unitialised: time goes from 18.3s->14.5s with COOLING_BEAM_LIMIT of 16. With COOLING_BEAM_LIMIT of 8 it goes to 14.1s.

@ElliotB256
Copy link
Collaborator Author

Implemented in 7d0f52d

@ElliotB256
Copy link
Collaborator Author

Note that without ApplyEmissionForceOption and EnableScatteringFluctuations, the benchmark comes in at 7.9s.

@ElliotB256
Copy link
Collaborator Author

ElliotB256 commented Feb 11, 2021

Calculate the rate coefficients is currently one of the longest operations in the code and this particular implementation is back-end bound (both in memory and core):

            (
                &laser_detunings,
                &laser_intensities,
                &atomic_transition,
                &magnetic_field_sampler,
                &mut rate_coefficients,
            )
                .par_join()
                .for_each(|(detunings, intensities, atominfo, bfield, rates)| {
                    let beam_direction_vector = gaussian.direction.normalize();
                    let costheta = if &bfield.field.norm_squared() < &(10.0 * f64::EPSILON) {
                        0.0
                    } else {
                        beam_direction_vector
                            .normalize()
                            .dot(&bfield.field.normalize())
                    };

                    let prefactor =
                        atominfo.rate_prefactor * intensities.contents[index.index].intensity;

                    let scatter1 =
                        0.25 * (cooling.polarization as f64 * costheta + 1.).powf(2.) * prefactor
                            / (detunings.contents[index.index].detuning_sigma_plus.powf(2.)
                                + (PI * atominfo.linewidth).powf(2.));

                    let scatter2 =
                        0.25 * (cooling.polarization as f64 * costheta - 1.).powf(2.) * prefactor
                            / (detunings.contents[index.index]
                                .detuning_sigma_minus
                                .powf(2.)
                                + (PI * atominfo.linewidth).powf(2.));

                    let scatter3 = 0.5 * (1. - costheta.powf(2.)) * prefactor
                        / (detunings.contents[index.index].detuning_pi.powf(2.)
                            + (PI * atominfo.linewidth).powf(2.));
                    rates.contents[index.index].rate = scatter1 + scatter2 + scatter3;
                });
        }

This code is problematic for various reasons: first, it doesn't get vectorised, so the float operations are scalar. Second is the 'random' access through the sampler arrays using the cooling light index. It's also inefficient that we dispatch a par_for once for each laser beam.

An easy way to improve this would be to also pack the required CoolingLight and GaussianBeam structs into a stack allocated array, then dispatch one par_for over atoms to calculate rates. This par_for would contain an iteration over each element of the array, which could probably be simd compiled. At the very least, it increases the work per thread for each atom in the par_for and that should be a good thing.

@ElliotB256
Copy link
Collaborator Author

This was attempted in 4098053, but I wasn't able to get llvm to use vectorized operations. Reverted for now.

@ElliotB256
Copy link
Collaborator Author

ElliotB256 commented Feb 11, 2021

Tried it again in e425a5e but no meaningful improvement.

@ElliotB256
Copy link
Collaborator Author

Made remaining laser systems parallel in c69c0b4. Code is now very evenly distributed over all processors and is DRAM bandwidth bound ~50% of the time.
image

@ElliotB256
Copy link
Collaborator Author

ElliotB256 commented Feb 13, 2021

For general interest, I did a performance to benchmark specs against the simplest possible Matlab code. It's just a loop that adds and multiplies some float arrays of 1 million elements.

tic
for i=1:n
    a = a + b .* a;
end
duration = toc()*1e6;
fprintf('wall time/step=%fus\n', duration/n)
fprintf('wall time/step/atom=%fus\n', duration/n/1e6)

This gives:

wall time/step/atom=0.000431us```


For comparison, the specs program looks like so:
```extern crate nalgebra;
use nalgebra::Vector3;
use rand::distributions::{Distribution, Normal};
use specs::{
    Builder, Component, DispatcherBuilder, ReadStorage, System, VecStorage, World, WriteStorage,
};
use std::time::Instant;

pub struct ComponentA {
    x: f64,
    y: f64,
    z: f64,
}
impl Component for ComponentA {
    type Storage = VecStorage<Self>;
}

pub struct ComponentB {
    pub value: f64,
}
impl Component for ComponentB {
    type Storage = VecStorage<Self>;
}

pub struct ComponentC {
    pub value: f64,
}
impl Component for ComponentC {
    type Storage = VecStorage<Self>;
}

pub struct SimpleSystem;
impl<'a> System<'a> for SimpleSystem {
    type SystemData = (
        ReadStorage<'a, ComponentA>,
        ReadStorage<'a, ComponentB>,
        WriteStorage<'a, ComponentC>,
    );

    fn run(&mut self, (a, b, mut c): Self::SystemData) {
        use rayon::prelude::*;
        use specs::ParJoin;

        (&a, &b, &mut c).par_join().for_each(|(a, b, mut c)| {
            c.value = a.x * b.value + a.y;
        });
    }
}

fn main() {
    let mut world = World::new();
    let mut builder = DispatcherBuilder::new();

    builder.add(SimpleSystem, "simple_system", &[]);

    let pool = rayon::ThreadPoolBuilder::new()
        .num_threads(6)
        .build()
        .unwrap();

    builder.add_pool(::std::sync::Arc::new(pool));
    let mut dispatcher = builder.build();
    dispatcher.setup(&mut world.res);

    // Add test entities
    let mut rng = rand::thread_rng();
    let dist = Normal::new(0.0, 1.0);

    for _ in 0..1000000 {
        world
            .create_entity()
            .with(ComponentA {
                x: 1.0,
                y: 0.0,
                z: dist.sample(&mut rng),
            })
            .with(ComponentB {
                value: dist.sample(&mut rng),
            })
            .with(ComponentC { value: 0.0 })
            .build();
    }

    let loop_start = Instant::now();

    for _ in 0..10000 {
        dispatcher.dispatch(&mut world.res);
        world.maintain();
    }

    println!(
        "Simulation loop completed in {} ms.",
        loop_start.elapsed().as_millis()
    );
}

The iteration over 1,000,000 entities, 10k steps takes 18s. That gives a wall time per step of 1.8ms, and wall time/step/atom of 1.8ns, which a factor of ~4 slower than straight multiplication in matlab. The matlab multi is compiled using float vector routines which I still can't get working on rust, so that probably makes up for the difference. All in all, I think the safety and flexibility of the rust code is worth it. This benchmark is also a case that favors the matlab - multiplication of matrices is kind of matlab's whole point. I expect the rust will get the same performance if I manage to get vector routines working, but I think enough is enough for now.

@ElliotB256 ElliotB256 added the enhancement New feature or request label Feb 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant