Profiling for red-mot #9

ElliotB256 · 2021-02-09T11:40:18Z

Starting to profile the latest changes to red-mot to get a head start on optimisations.

Interestingly, it seems the vast majority of program time was being taken up by rayon implementations - specifically SwitchToThread and SleepConditionVariableSRW. There was also poor spin time.

This seems likely related to a known issue in rayon, where sleeping threads would burn a lot of cycles/power. For example, see
rayon-rs/rayon#642
rayon-rs/rayon#795
rayon-rs/rfcs#5
It looks like some of these issues were fixed in rayon 1.4.0, although we are currently on rayon 1.3.0. I'll try updating rayon (new version is 1.5.0) and see if it improves things.

The text was updated successfully, but these errors were encountered:

ElliotB256 · 2021-02-09T12:40:54Z

Upgrading removed the spin time from SwitchToThread, which is nice.

Hotspots are now as below:

It's encouraging that 'pow' and 'log' (two elementary mathematical functions) are already some of the bottlenecks in the program. We are still not properly load balanced (this explains some of the context switching cost/sleep condition), see eg over 8 threads:

The pow:
75.6% of the pow time is taken up calculating the -3rd power of the laser frequency:

It is straightforward to precalculate this instead and store it in the struct.

ElliotB256 · 2021-02-09T12:45:57Z

The log:
The log is all coming from Poisson dist in photons_scattered.rs. It's only 2.8% of program time.

laser::rate::...::run

Another 2.4% is taken up by normalising the laser beam direction.
also calculating c^2 in prefactor.

laser::force::...::run:

More time taken up by normalising direction of gaussian beam.

ElliotB256 · 2021-02-09T12:48:52Z

I'll add inner parallelism to laser::rate and laser::force - it should help reduce the costs from sleeping threads and also reduce these systems.

ElliotB256 · 2021-02-09T13:02:47Z

Adding inner parallelism for laser::rate reduced wall time from 63s to 51s and improves load balancing over processors (still not ideal, but better - compare to above).

ElliotB256 · 2021-02-09T14:44:32Z

Implemented inner parallelism for laser::force::CalculateAbsorptionForcesSystem, time from 51->46s

ElliotB256 · 2021-02-09T14:58:37Z

Introduced precalculation for rate_prefactor to avoid f64.pow(-3f), and some other operations. 46s->42s.

ElliotB256 · 2021-02-09T15:24:21Z

Tried removing ecs barriers, doesn't make much difference (~42s both times)

ElliotB256 · 2021-02-09T15:27:59Z

Benchmark was being limited by writing to file, so I removed file output - 42s->37.7s

ElliotB256 · 2021-02-09T16:01:28Z

It's possible that with 10k atoms there just isn't enough data to run across 6 cores with reasonable load. If I turn up the atom number in the benchmark to 100k atoms it is very clear that parallel systems now achieve proper load balancing:

To remedy this, I made a number of vector initialisations (eg samplers) inner parallel and also the actual photon number calcs. Time went from 43s->30.5s with improved load balancing (shown is about ~4 steps) for 100k atoms.

ElliotB256 · 2021-02-09T16:05:04Z

Counterintuitively, this gives significantly better performance at lower atom numbers. 🤔

For 10k atoms at 5k steps, the benchmark time goes from 37.7s->22s. Much better load balancing

ElliotB256 · 2021-02-09T16:17:46Z

I'm profiling on an i7-8700 which is a 6 core, 12 thread PC.

Running w/12 thread pool:

17.8s of CPU time is lost to spinning. Breakdown of CPU utilisation:

Total time = 20.4s

Running w/10 thread pool:

11s of CPU time lost to spinning (SwitchToThread). CPU utilisation:

Total time = 20.6s

Running w/6 thread pool:

5.9s of CPU time lost to spinning (SwitchToThread).

Total time = 22.6s

Running w/4 thread pool:

Total time = 27.3s

Running w/2 thread pool:

Total time = 40.7 (pretty much similar to what we had when we had poor load balancing before optimisation).

This is pretty much as expected - hyperthreading does not really benefit the simulation, what matters is the real number of cores on the machine. Futhermore, one thread seems to sleep (one in the pool gets used up as a main thread, which doesn't do much). Hence, running > 6 threads doesn't really give much advantage on my 6 core PC.

ElliotB256 · 2021-02-09T16:22:26Z

87b1d38

ElliotB256 · 2021-02-09T19:15:23Z

More profiling fun.

I noticed there are some heap allocations which are taking up a lot of time. They come from the various vec! Vec<T> allocations in the program, which store memory on the heap (see eg https://doc.rust-lang.org/std/vec/struct.Vec.html#guarantees). These are being used mostly for samplers (due to having an unknown number of laser beams).

It does seem somewhat counter-productive to the general ECS data pattern to use heap-allocated vec components. I am tempted to try a branch where we cap the number of laser beams (eg 16? some user-defined variable) and instead use fixed-size sampler arrays. It's more memory allocation, but having that memory not heap allocated might just be faster. Fairly straight forward to test it.

ElliotB256 · 2021-02-09T22:27:34Z

Still more profiling, and it's a big one!

Following the above, I removed all heap-allocated vector components (ie, those through vec!) and instead implemented fixed-size arrays (nominally supports up to 16 lasers - its just a constant you can tweak). Details in 5bfddc1.

The performance increase is staggering! Runtime is now 10.6s, down from 21.5s - another factor of two in speed. The effective CPU utilisation is extremely good.

One caveat - I broke something during the changes! So, the simulation is not actually giving the correct physics right now. Nonetheless, the same number of systems are running, so I believe we can believe the performance. It's probably worth fixing the red-mot-perf branch and merging it back onto red-mot.

ElliotB256 · 2021-02-10T09:59:07Z

Looking at this again, some of the speedup was because poisson dist was no longer being calculated due to a nan error. In 1c5fd I implemented LaserSamplerMasks to remove the nan which occured when summing rates for undefined lasers.

There is a performance gain, but not quite as substantial as before - 17.7s compared to 22.5s using vec!. We are losing a lot of time to spinning, so I'll try and tweak that. Maybe too much par_for.

ElliotB256 · 2021-02-10T10:18:42Z

In 8558d17 I improved system-level parallelism, which shaved time from 17.7s to 16s. Still about 25s total CPU time on spin, which probably costs around 2s of wall time.

ElliotB256 · 2021-02-10T10:30:30Z

If I remove the Poisson dist and instead just return lambda, time goes from 16s->12.2s. I still have a lot of spin time though. At the microscopic level, this is clearly due to a few single threaded systems blocking while they complete.

ElliotB256 · 2021-02-10T17:20:39Z

Other things tested -

Using lto = true in the cargo.toml under release profile to give fat linking. Doesn't give any speedup on benchmark, although compile time ends up about 1minute rather than 4s.
Using native target flag in the rust compilation, eg RUSTFLAG=-Ctarget-cpu=native. Simulation time goes from 18s to 19s.
Compiling with llvm-args=-ffast-math, to better hint loop vectorisation for the compiler. No difference.

ElliotB256 · 2021-02-10T18:14:40Z

Microarchitecture profiling

Microarchitecture profiling with a laser sampler COOLING_BEAM_LIMIT (determined cache size)=16 gives:

Total runtime = 17.9s.

If we reduce the laser sampler COOLING_BEAM_LIMIT to 8 we instead find:

Total runtime = 14.45s.

A few more notes for future optimisation:

loop vectorisation will help for the core-bound instructions. It should be possible to vectorise pow and log (used by the Poisson dist and currently a large fraction of simulation time).
A number of sampler arrays are initialised to a zero value every frame. Now we have the masks, we can instead leave unused entries as 'dirty' and not reset them to zero (they wont be used in the calculation anyway if the mask is clear).

ElliotB256 · 2021-02-10T18:22:11Z

Removing sampler initialisation and instead relying on mask and leaving unused data unitialised: time goes from 18.3s->14.5s with COOLING_BEAM_LIMIT of 16. With COOLING_BEAM_LIMIT of 8 it goes to 14.1s.

ElliotB256 · 2021-02-10T18:34:26Z

Implemented in 7d0f52d

ElliotB256 · 2021-02-10T18:38:47Z

Note that without ApplyEmissionForceOption and EnableScatteringFluctuations, the benchmark comes in at 7.9s.

ElliotB256 · 2021-02-11T09:21:48Z

Calculate the rate coefficients is currently one of the longest operations in the code and this particular implementation is back-end bound (both in memory and core):

            (
                &laser_detunings,
                &laser_intensities,
                &atomic_transition,
                &magnetic_field_sampler,
                &mut rate_coefficients,
            )
                .par_join()
                .for_each(|(detunings, intensities, atominfo, bfield, rates)| {
                    let beam_direction_vector = gaussian.direction.normalize();
                    let costheta = if &bfield.field.norm_squared() < &(10.0 * f64::EPSILON) {
                        0.0
                    } else {
                        beam_direction_vector
                            .normalize()
                            .dot(&bfield.field.normalize())
                    };

                    let prefactor =
                        atominfo.rate_prefactor * intensities.contents[index.index].intensity;

                    let scatter1 =
                        0.25 * (cooling.polarization as f64 * costheta + 1.).powf(2.) * prefactor
                            / (detunings.contents[index.index].detuning_sigma_plus.powf(2.)
                                + (PI * atominfo.linewidth).powf(2.));

                    let scatter2 =
                        0.25 * (cooling.polarization as f64 * costheta - 1.).powf(2.) * prefactor
                            / (detunings.contents[index.index]
                                .detuning_sigma_minus
                                .powf(2.)
                                + (PI * atominfo.linewidth).powf(2.));

                    let scatter3 = 0.5 * (1. - costheta.powf(2.)) * prefactor
                        / (detunings.contents[index.index].detuning_pi.powf(2.)
                            + (PI * atominfo.linewidth).powf(2.));
                    rates.contents[index.index].rate = scatter1 + scatter2 + scatter3;
                });
        }

This code is problematic for various reasons: first, it doesn't get vectorised, so the float operations are scalar. Second is the 'random' access through the sampler arrays using the cooling light index. It's also inefficient that we dispatch a par_for once for each laser beam.

An easy way to improve this would be to also pack the required CoolingLight and GaussianBeam structs into a stack allocated array, then dispatch one par_for over atoms to calculate rates. This par_for would contain an iteration over each element of the array, which could probably be simd compiled. At the very least, it increases the work per thread for each atom in the par_for and that should be a good thing.

ElliotB256 · 2021-02-11T16:42:36Z

This was attempted in 4098053, but I wasn't able to get llvm to use vectorized operations. Reverted for now.

ElliotB256 · 2021-02-11T18:27:25Z

Tried it again in e425a5e but no meaningful improvement.

ElliotB256 · 2021-02-12T17:20:31Z

Made remaining laser systems parallel in c69c0b4. Code is now very evenly distributed over all processors and is DRAM bandwidth bound ~50% of the time.

ElliotB256 · 2021-02-13T11:26:54Z

For general interest, I did a performance to benchmark specs against the simplest possible Matlab code. It's just a loop that adds and multiplies some float arrays of 1 million elements.

tic
for i=1:n
    a = a + b .* a;
end
duration = toc()*1e6;
fprintf('wall time/step=%fus\n', duration/n)
fprintf('wall time/step/atom=%fus\n', duration/n/1e6)

This gives:

wall time/step/atom=0.000431us```


For comparison, the specs program looks like so:
```extern crate nalgebra;
use nalgebra::Vector3;
use rand::distributions::{Distribution, Normal};
use specs::{
    Builder, Component, DispatcherBuilder, ReadStorage, System, VecStorage, World, WriteStorage,
};
use std::time::Instant;

pub struct ComponentA {
    x: f64,
    y: f64,
    z: f64,
}
impl Component for ComponentA {
    type Storage = VecStorage<Self>;
}

pub struct ComponentB {
    pub value: f64,
}
impl Component for ComponentB {
    type Storage = VecStorage<Self>;
}

pub struct ComponentC {
    pub value: f64,
}
impl Component for ComponentC {
    type Storage = VecStorage<Self>;
}

pub struct SimpleSystem;
impl<'a> System<'a> for SimpleSystem {
    type SystemData = (
        ReadStorage<'a, ComponentA>,
        ReadStorage<'a, ComponentB>,
        WriteStorage<'a, ComponentC>,
    );

    fn run(&mut self, (a, b, mut c): Self::SystemData) {
        use rayon::prelude::*;
        use specs::ParJoin;

        (&a, &b, &mut c).par_join().for_each(|(a, b, mut c)| {
            c.value = a.x * b.value + a.y;
        });
    }
}

fn main() {
    let mut world = World::new();
    let mut builder = DispatcherBuilder::new();

    builder.add(SimpleSystem, "simple_system", &[]);

    let pool = rayon::ThreadPoolBuilder::new()
        .num_threads(6)
        .build()
        .unwrap();

    builder.add_pool(::std::sync::Arc::new(pool));
    let mut dispatcher = builder.build();
    dispatcher.setup(&mut world.res);

    // Add test entities
    let mut rng = rand::thread_rng();
    let dist = Normal::new(0.0, 1.0);

    for _ in 0..1000000 {
        world
            .create_entity()
            .with(ComponentA {
                x: 1.0,
                y: 0.0,
                z: dist.sample(&mut rng),
            })
            .with(ComponentB {
                value: dist.sample(&mut rng),
            })
            .with(ComponentC { value: 0.0 })
            .build();
    }

    let loop_start = Instant::now();

    for _ in 0..10000 {
        dispatcher.dispatch(&mut world.res);
        world.maintain();
    }

    println!(
        "Simulation loop completed in {} ms.",
        loop_start.elapsed().as_millis()
    );
}

The iteration over 1,000,000 entities, 10k steps takes 18s. That gives a wall time per step of 1.8ms, and wall time/step/atom of 1.8ns, which a factor of ~4 slower than straight multiplication in matlab. The matlab multi is compiled using float vector routines which I still can't get working on rust, so that probably makes up for the difference. All in all, I think the safety and flexibility of the rust code is worth it. This benchmark is also a case that favors the matlab - multiplication of matrices is kind of matlab's whole point. I expect the rust will get the same performance if I manage to get vector routines working, but I think enough is enough for now.

ElliotB256 transferred this issue from another repository Feb 9, 2021

ElliotB256 added the enhancement New feature or request label Feb 14, 2021

ElliotB256 closed this as completed Feb 15, 2021

ElliotB256 mentioned this issue Jun 4, 2021

Panicks when more than 16 lasers #55

Closed

ElliotB256 mentioned this issue Jan 7, 2022

Const generics num of laser #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling for red-mot #9

Profiling for red-mot #9

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 11, 2021 •

edited

ElliotB256 commented Feb 11, 2021

ElliotB256 commented Feb 11, 2021 •

edited

ElliotB256 commented Feb 12, 2021

ElliotB256 commented Feb 13, 2021 •

edited

Profiling for red-mot #9

Profiling for red-mot #9

Comments

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

Running w/12 thread pool:

Running w/10 thread pool:

Running w/6 thread pool:

Running w/4 thread pool:

Running w/2 thread pool:

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 9, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

Microarchitecture profiling

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 10, 2021

ElliotB256 commented Feb 11, 2021 • edited

ElliotB256 commented Feb 11, 2021

ElliotB256 commented Feb 11, 2021 • edited

ElliotB256 commented Feb 12, 2021

ElliotB256 commented Feb 13, 2021 • edited

ElliotB256 commented Feb 11, 2021 •

edited

ElliotB256 commented Feb 11, 2021 •

edited

ElliotB256 commented Feb 13, 2021 •

edited