Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimizations (cont) #395

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Conversation

QuarticCat
Copy link
Contributor

@QuarticCat QuarticCat commented Sep 30, 2022

This time I tried some radical optimizations.

Benchmark Approach

To make the result more accurate, I updated my benchmark approach. Here's the command:

cargo build --release &&
with-bench hyperfine --warmup=3 "$(echo ~/.cargo/target/release/difft sample_files/slow_{before,after}.rs)" &&
with-bench perf stat ~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null &&
time ~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null

where with-bench is a simple Zsh function that fixes CPU frequency and disables boost:

_bench-start() {
    sudo cpupower frequency-set -u 3.6G -d 3.6G >/dev/null
    sudo sh -c 'echo 0 > /sys/devices/system/cpu/cpufreq/boost'
    echo '>>>>> BENCH START' >&2
}
_bench-end() {
    sudo cpupower frequency-set -u 10G -d 0.1G >/dev/null
    sudo sh -c 'echo 1 > /sys/devices/system/cpu/cpufreq/boost'
    echo '>>>>> BENCH END' >&2
}
with-bench() {
    _bench-start
    trap '_bench-end' EXIT INT
    $@
}

Note that this time difft is directly invoked instead of through cargo run, so the speedup percentage will be higher (cargo run has a fixed extra cost).

Benchmark Results

Before my first PR:

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):      1.120 s ±  0.032 s    [User: 1.080 s, System: 0.039 s]
  Range (min … max):    1.058 s …  1.169 s    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

          1,113.87 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,423      page-faults:u             #    1.278 K/sec                  
     3,844,769,039      cycles:u                  #    3.452 GHz                    
       319,285,519      stalled-cycles-frontend:u #    8.30% frontend cycles idle   
     1,172,227,367      stalled-cycles-backend:u  #   30.49% backend cycles idle    
     4,498,772,345      instructions:u            #    1.17  insn per cycle         
                                                  #    0.26  stalled cycles per insn
       887,538,102      branches:u                #  796.804 M/sec                  
        19,835,182      branch-misses:u           #    2.23% of all branches        

       1.116052704 seconds time elapsed

       1.062544000 seconds user
       0.049941000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   1.03s  user 0.05s system 99% cpu 1.077 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                549 KB
page faults from disk:     0
other page faults:         1490

Before my second PR:

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     829.9 ms ±  15.8 ms    [User: 798.4 ms, System: 30.5 ms]
  Range (min … max):   811.7 ms … 864.5 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            824.10 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,361      page-faults:u             #    1.652 K/sec                  
     2,838,969,066      cycles:u                  #    3.445 GHz                    
        49,902,292      stalled-cycles-frontend:u #    1.76% frontend cycles idle   
     1,171,930,117      stalled-cycles-backend:u  #   41.28% backend cycles idle    
     3,511,062,625      instructions:u            #    1.24  insn per cycle         
                                                  #    0.33  stalled cycles per insn
       663,773,868      branches:u                #  805.457 M/sec                  
        18,248,650      branch-misses:u           #    2.75% of all branches        

       0.824474162 seconds time elapsed

       0.787025000 seconds user
       0.036682000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.68s  user 0.04s system 99% cpu 0.726 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                435 KB
page faults from disk:     0
other page faults:         1430

Now:

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     636.9 ms ±   7.4 ms    [User: 596.3 ms, System: 38.8 ms]
  Range (min … max):   629.2 ms … 655.3 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            623.62 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,348      page-faults:u             #    2.162 K/sec                  
     2,122,551,842      cycles:u                  #    3.404 GHz                    
        38,713,265      stalled-cycles-frontend:u #    1.82% frontend cycles idle   
       776,217,583      stalled-cycles-backend:u  #   36.57% backend cycles idle    
     2,669,241,946      instructions:u            #    1.26  insn per cycle         
                                                  #    0.29  stalled cycles per insn
       510,784,798      branches:u                #  819.070 M/sec                  
        15,456,464      branch-misses:u           #    3.03% of all branches        

       0.623958630 seconds time elapsed

       0.596642000 seconds user
       0.026663000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.51s  user 0.05s system 99% cpu 0.553 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                409 KB
page faults from disk:     0
other page faults:         1417

Conclusion

Speed: 100% -> 135% -> 176% (according to hyperfine)

Memory: 100% -> 79% -> 74%

Caveats

  • In commit Eliminate some vec clones, the memory usage abnormally increased, which was not in line with my expectation. I haven't figured out why.

  • In commit Change a RefCell in Vertex to UnsafeCell, a lot of unsafe code is applied, and they are apparently out of the boundary that they should stay. I don't know how to design abstractions for them.

  • In commit Refactor seen map, I don't understand why your original code was written in this way. I just faithfully convert your code into a faster one.

@QuarticCat
Copy link
Contributor Author

Refactor parents' representation.

Speed: 187%

Memory: 73%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     600.1 ms ±   3.8 ms    [User: 567.1 ms, System: 31.5 ms]
  Range (min … max):   594.9 ms … 606.2 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            582.59 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,344      page-faults:u             #    2.307 K/sec                  
     1,976,757,963      cycles:u                  #    3.393 GHz                    
        35,702,044      stalled-cycles-frontend:u #    1.81% frontend cycles idle   
       760,550,837      stalled-cycles-backend:u  #   38.47% backend cycles idle    
     2,545,504,360      instructions:u            #    1.29  insn per cycle         
                                                  #    0.30  stalled cycles per insn
       472,339,368      branches:u                #  810.760 M/sec                  
        13,293,176      branch-misses:u           #    2.81% of all branches        

       0.582992026 seconds time elapsed

       0.552362000 seconds user
       0.030102000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.50s  user 0.03s system 99% cpu 0.528 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                401 KB
page faults from disk:     0
other page faults:         1418

@QuarticCat
Copy link
Contributor Author

Compress EnteredDelimiter.

Speed: 189%

Memory: 71%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     591.4 ms ±  11.1 ms    [User: 559.8 ms, System: 30.9 ms]
  Range (min … max):   582.6 ms … 618.6 ms    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            576.92 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,339      page-faults:u             #    2.321 K/sec                  
     1,960,597,413      cycles:u                  #    3.398 GHz                    
        47,556,040      stalled-cycles-frontend:u #    2.43% frontend cycles idle   
       711,186,359      stalled-cycles-backend:u  #   36.27% backend cycles idle    
     2,530,607,392      instructions:u            #    1.29  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       467,326,782      branches:u                #  810.038 M/sec                  
        13,318,875      branch-misses:u           #    2.85% of all branches        

       0.577286658 seconds time elapsed

       0.529902000 seconds user
       0.046657000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.51s  user 0.02s system 99% cpu 0.530 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                391 KB
page faults from disk:     0
other page faults:         1408

@QuarticCat
Copy link
Contributor Author

Reserve vec capacity.

Speed: 202%

Memory: 71%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     555.5 ms ±   5.2 ms    [User: 521.3 ms, System: 33.5 ms]
  Range (min … max):   549.2 ms … 564.5 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            560.13 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,339      page-faults:u             #    2.391 K/sec                  
     1,898,772,574      cycles:u                  #    3.390 GHz                    
        39,751,378      stalled-cycles-frontend:u #    2.09% frontend cycles idle   
       677,732,625      stalled-cycles-backend:u  #   35.69% backend cycles idle    
     2,385,056,239      instructions:u            #    1.26  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       439,507,344      branches:u                #  784.657 M/sec                  
        13,799,727      branch-misses:u           #    3.14% of all branches        

       0.560540216 seconds time elapsed

       0.526526000 seconds user
       0.033323000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.48s  user 0.03s system 99% cpu 0.511 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                391 KB
page faults from disk:     0
other page faults:         1404

@QuarticCat
Copy link
Contributor Author

Compress seen map.

Here hashbrown is introduced since it has a get_key_value_mut method.

Speed: 207%

Memory: 68%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     540.0 ms ±  10.0 ms    [User: 509.0 ms, System: 30.0 ms]
  Range (min … max):   529.4 ms … 557.1 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            520.83 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,334      page-faults:u             #    2.561 K/sec                  
     1,761,757,697      cycles:u                  #    3.383 GHz                    
        35,457,864      stalled-cycles-frontend:u #    2.01% frontend cycles idle   
       601,687,452      stalled-cycles-backend:u  #   34.15% backend cycles idle    
     2,244,942,485      instructions:u            #    1.27  insn per cycle         
                                                  #    0.27  stalled cycles per insn
       418,526,690      branches:u                #  803.569 M/sec                  
        13,789,606      branch-misses:u           #    3.29% of all branches        

       0.521212130 seconds time elapsed

       0.483899000 seconds user
       0.036709000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.43s  user 0.04s system 99% cpu 0.469 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                375 KB
page faults from disk:     0
other page faults:         1396

@QuarticCat
Copy link
Contributor Author

Skip visited vertices.

Speed: 212%

Memory: 68%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     529.0 ms ±   6.1 ms    [User: 494.6 ms, System: 33.7 ms]
  Range (min … max):   520.8 ms … 537.4 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            523.72 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,331      page-faults:u             #    2.541 K/sec                  
     1,774,187,486      cycles:u                  #    3.388 GHz                    
        44,683,168      stalled-cycles-frontend:u #    2.52% frontend cycles idle   
       588,601,309      stalled-cycles-backend:u  #   33.18% backend cycles idle    
     2,238,969,768      instructions:u            #    1.26  insn per cycle         
                                                  #    0.26  stalled cycles per insn
       419,261,132      branches:u                #  800.545 M/sec                  
        13,633,920      branch-misses:u           #    3.25% of all branches        

       0.523955753 seconds time elapsed

       0.490112000 seconds user
       0.033338000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.44s  user 0.02s system 99% cpu 0.465 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                375 KB
page faults from disk:     0
other page faults:         1398

@QuarticCat
Copy link
Contributor Author

Refactor shortest path algorithm. This commit also removes some unsafe code.

Speed: 222%

Memory: 41%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     504.0 ms ±   6.5 ms    [User: 485.0 ms, System: 18.3 ms]
  Range (min … max):   498.3 ms … 519.4 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            496.49 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,257      page-faults:u             #    2.532 K/sec                  
     1,717,408,575      cycles:u                  #    3.459 GHz                    
        25,289,472      stalled-cycles-frontend:u #    1.47% frontend cycles idle   
       633,182,189      stalled-cycles-backend:u  #   36.87% backend cycles idle    
     2,244,678,474      instructions:u            #    1.31  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       424,478,871      branches:u                #  854.952 M/sec                  
        13,665,200      branch-misses:u           #    3.22% of all branches        

       0.496761488 seconds time elapsed

       0.482851000 seconds user
       0.013317000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.42s  user 0.03s system 99% cpu 0.447 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                227 KB
page faults from disk:     0
other page faults:         1327

@QuarticCat
Copy link
Contributor Author

Done. You can merge this PR now. If I have other optimizations I will open another PR.

The next big optimization opportunity might be parallelizing the for ... in possibly_changed loop. But it involves so many Cells that I feel tired to refactor.

@Wilfred
Copy link
Owner

Wilfred commented Oct 1, 2022

Wow, this is really cool. I'm not sure about the changes to SeenMap: I deliberately wanted a vec so I could easily experiment with different sizes. The rest looks good at first glance, I think the use of a visited flag on graph nodes is particularly nice.

I'm a little busy at the moment, but I will do a proper merge and review as soon as I can :)

@QuarticCat QuarticCat mentioned this pull request Oct 3, 2022
@QuarticCat
Copy link
Contributor Author

The third and maybe the last big performance improvement PR is ready. Given that you haven't merged this one, I would like to know that do you prefer adding them into this PR?

@QuarticCat
Copy link
Contributor Author

Fix a small problem: I was using ZSH's built-in time command to measure the memory usage, and the format was set to

TIMEFMT="\
%J   %U  user %S system %P cpu %*E total
avg shared (code):         %X KB
avg unshared (data/stack): %D KB
total (sum):               %K KB
max memory:                %M KB
page faults from disk:     %F
other page faults:         %R"

According to ZSH's doc, %M is in KB. But it was actually MB. Anyway, that doesn't affect the percentage.

@Wilfred
Copy link
Owner

Wilfred commented Oct 7, 2022

OK, I've cherry-picked the first three commits and I'll follow up on the rest when I can :)

If you have further awesome improvements, perhaps it would be clearer as a separate PR? I don't feel strongly though.

@QuarticCat
Copy link
Contributor Author

Any update? If you have any questions, feel free to ask me. It's my pleasure to explain my optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants