mycpp: only emit StackRoots if Collect() possbile #1963

melvinw · 2024-05-09T01:58:41Z

Most instances of StackRoot are unnecessary because they are created in functions where there is no danger of the garbage collector being invoked. This commit changes mycpp to construct a graph of function calls across the entire program and uses it to decide if StackRoots are necessary for a given function. An directed edge u -> v means u calls v. Mycpp builds up this graph during the declaration pass, and passes it to the definition pass, which uses it to search for a path from the body of a function to mylib.MaybeCollect, only emitting StackRoots if a path is found.

This reduces the number of stack roots by almost ~~10x~~ (EDIT: The new examples caught a bug. This is actually ~5x. The performance impact is about the same, though.) and has a noticeable impact on the wall time of workloads like the CPython configure script (see reports below. ~1-2% or 10-150ms). It's important to note, though that this optimization only considers the static call graph. Many of the remaining functions with StackRoots are functions in the CommandEvaluator that are mutually recursive. Reducing overhead from those will require a different approach.

stats from master

+ perf stat -r 20 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000' (20 runs):

          1,019.89 msec task-clock                       #    0.969 CPUs utilized               ( +-  0.03% )
             1,138      context-switches                 #    1.116 K/sec                       ( +-  0.12% )
                 1      cpu-migrations                   #    0.980 /sec
           136,557      page-faults                      #  133.893 K/sec
     2,103,559,764      cycles                           #    2.063 GHz                         ( +-  0.95% )  (97.96%)
     1,086,119,441      stalled-cycles-frontend          #   51.63% frontend cycles idle        ( +-  1.38% )  (97.01%)
       591,005,976      stalled-cycles-backend           #   28.10% backend cycles idle         ( +-  2.17% )  (88.98%)
     1,793,734,183      instructions                     #    0.85  insn per cycle
                                                  #    0.61  stalled cycles per insn     ( +-  1.33% )  (98.37%)
       409,963,765      branches                         #  401.967 M/sec                       ( +-  1.01% )  (98.49%)
         9,663,067      branch-misses                    #    2.36% of all branches             ( +-  1.02% )  (98.04%)

          1.052296 +- 0.000366 seconds time elapsed  ( +-  0.03% )

+ perf stat -r 10 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure' (10 runs):

         23,876.67 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.01% )
            15,161      context-switches                 #  634.971 /sec                        ( +-  0.17% )
                 1      cpu-migrations                   #    0.042 /sec
         2,036,884      page-faults                      #   85.309 K/sec                       ( +-  0.00% )
    54,840,613,270      cycles                           #    2.297 GHz                         ( +-  0.09% )  (86.61%)
    26,062,053,505      stalled-cycles-frontend          #   47.52% frontend cycles idle        ( +-  0.14% )  (86.72%)
    16,536,908,235      stalled-cycles-backend           #   30.15% backend cycles idle         ( +-  0.18% )  (72.95%)
    57,104,919,153      instructions                     #    1.04  insn per cycle
                                                  #    0.46  stalled cycles per insn     ( +-  0.07% )  (86.88%)
    11,921,483,800      branches                         #  499.294 M/sec                       ( +-  0.05% )  (86.92%)
       336,030,711      branch-misses                    #    2.82% of all branches             ( +-  0.09% )  (86.38%)

          24.11207 +- 0.00243 seconds time elapsed  ( +-  0.01% )

stats from this branch

+ perf stat -r 20 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000' (20 runs):

          1,002.17 msec task-clock                       #    0.969 CPUs utilized               ( +-  0.05% )
             1,131      context-switches                 #    1.129 K/sec                       ( +-  0.13% )
                 0      cpu-migrations                   #    0.000 /sec
           134,551      page-faults                      #  134.260 K/sec
     2,048,454,210      cycles                           #    2.044 GHz                         ( +-  1.16% )  (97.09%)
     1,056,078,784      stalled-cycles-frontend          #   51.55% frontend cycles idle        ( +-  1.62% )  (97.11%)
       566,911,785      stalled-cycles-backend           #   27.68% backend cycles idle         ( +-  2.48% )  (89.54%)
     1,711,132,647      instructions                     #    0.84  insn per cycle
                                                  #    0.62  stalled cycles per insn     ( +-  1.63% )  (94.09%)
       406,395,132      branches                         #  405.516 M/sec                       ( +-  1.13% )  (99.02%)
         9,807,102      branch-misses                    #    2.41% of all branches             ( +-  0.62% )  (98.15%)

          1.034549 +- 0.000544 seconds time elapsed  ( +-  0.05% )

+ perf stat -r 10 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure' (10 runs):

         23,715.08 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.02% )
            15,137      context-switches                 #  638.286 /sec                        ( +-  0.27% )
                 0      cpu-migrations                   #    0.000 /sec
         2,023,000      page-faults                      #   85.304 K/sec                       ( +-  0.00% )
    54,591,018,618      cycles                           #    2.302 GHz                         ( +-  0.06% )  (86.59%)
    25,806,949,417      stalled-cycles-frontend          #   47.27% frontend cycles idle        ( +-  0.13% )  (86.94%)
    16,316,561,268      stalled-cycles-backend           #   29.89% backend cycles idle         ( +-  0.20% )  (72.63%)
    56,691,411,513      instructions                     #    1.04  insn per cycle
                                                  #    0.46  stalled cycles per insn     ( +-  0.06% )  (87.01%)
    11,871,229,727      branches                         #  500.577 M/sec                       ( +-  0.04% )  (86.41%)
       335,151,243      branch-misses                    #    2.82% of all branches             ( +-  0.11% )  (86.52%)

          23.95348 +- 0.00476 seconds time elapsed  ( +-  0.02% )

melvinw · 2024-05-09T02:15:08Z

This also improves many of the cachegrind benchmarks by a good margin: http://travis-ci.oilshell.org/github-jobs/6915/benchmarks2.wwz/_tmp/gc-cachegrind/index.html

This reduces the number of stack roots by almost 10x.

andychu · 2024-05-10T00:56:35Z

Thanks for looking into this, it's pretty awesome that it makes us clearly faster than bash on several benchmarks/compute, in addition to Python configure speedups

http://travis-ci.oilshell.org/github-jobs/6921/benchmarks.wwz/_tmp/compute/index.html

Also it seems to be a pretty small amount of code

It seems like we clearly need something like this -- it's a major improvement.

And I'm glad you got past some of the "smells" like module aliases

I am going to poke at the tests a bit, I think there are some more cases we could add

Also I am dying to get rid of the self.decl conditionals, and they came up in this PR. But I remember I looked into it a month or 2 ago while I was doing some significant mycpp enhancements

And maybe it seemed too onerous

Since this adds some extra logic in the decl pass, it might be worth looking into again

But more important is if there are any cases where this could cause a runtime error, i.e. with new code patterns from contributors

I would also like to make a mycpp.asdl file to kinda "lock down" the dialect of typed Python we're using. And maybe change from this inverted "visitor" style to a plain old switch statement (although maybe that causes duplicate traversal logic)

That might be possible without a full "yaks" rewrite ...

It segfaults Let's see if it segfaults without this rooting optimization

andychu · 2024-05-10T01:31:55Z

Hm I started hacking on some more tests and was able to break it

I pulled the new tests into a branch based off master, and ran them, and they pass

https://github.com/oilshell/oil/tree/test-call-graph

I didn't look into the root cause, but it appears this change causes a problem. The test case is pretty simple -- it creates a big list, deletes some items, and then collects periodically in a loop

ASAN flags the error, so you can just run the test_root_call_graph by itself to see the error

I didn't add the virtual function case yet, so there may be more

I think we need something like this, but it's not urgent and we might want to take some time to make mycpp better (depending on what the bug is)

melvinw · 2024-05-10T01:44:45Z

Nice! I'll take a look at this over the weekend. I'm curious what's getting tickled

andychu · 2024-05-10T14:36:26Z

BTW this gets rid of about 400 KiB of executable code

After - http://travis-ci.oilshell.org/github-jobs/6929/cpp-coverage.wwz/_tmp/metrics/compare-gcc-clang.txt

-rwxr-xr-x 1 uke uke  2200704 May 10 01:09 _bin/clang-opt/oils-for-unix.stripped
-rwxr-xr-x 1 uke uke  1708304 May 10 01:09 _bin/cxx-opt/oils-for-unix.stripped

Before:

-rwxr-xr-x 1 uke uke  3146944 May 10 04:14 _bin/clang-opt/oils-for-unix.stripped
-rwxr-xr-x 1 uke uke  2126160 May 10 04:14 _bin/cxx-opt/oils-for-unix.stripped

Actually nearly 1 MB for Clang for some reason !! We have a somewhat old version of Clang I think

So I hope that we can do this correctly within the existing mycpp, with a few modifications

But if not it might be some motivation to rework mycpp a bit

melvinw · 2024-05-10T15:24:57Z

Oh right! I forgot that the stack roots contributed so much to the size of the binary. That's an awesome second result

melvinw · 2024-05-10T23:30:47Z

Aha! I was using the caller's name instead of the callee as the destination of the edge in the trivial case! Apparently we don't have many calls like this that lead to the GC in the spec. The fix added less than 100 roots.

melvinw · 2024-05-10T23:36:54Z

The linter failure from CI is confusing. It's complaining about an unused import io, but we clearly call io.StringIO() twice...

I don't see it locally either.

[I] melvin@zoidberg:~/workdir/oil$ test/lint.sh soil-run
-----
Linting Python 2 code
-----
asdl/front_end.py:42:30 list comprehension redefines '_' from line 11
asdl/gen_cpp.py:406:13 local variable 'variant_name' is assigned to but never used
builtin/func_hay.py:52:9 local variable 'parse_opts' is assigned to but never used
builtin/json_ysh.py:64:13 local variable 'arg_jw' is assigned to but never used
builtin/read_osh.py:304:13 local variable 'lexer' is assigned to but never used
core/executor.py:40:5 redefinition of unused 'state' from line 24
core/state.py:585:9 local variable 'success' is assigned to but never used
data_lang/j8.py:141:30 list comprehension redefines 'b' from line 136
osh/glob_.py:214:23 list comprehension redefines '_' from line 207
osh/glob_.py:370:17 local variable 'is_glob' is assigned to but never used
osh/split.py:80:34 list comprehension redefines 'buf' from line 64
osh/word_compile.py:202:13 local variable 'line_ended' is assigned to but never used
osh/word_eval.py:1047:13 local variable 'UP_val' is assigned to but never used
osh/word_parse.py:954:9 local variable 'right_token' is assigned to but never used
tools/ysh_ify.py:337:9 local variable 'op_id' is assigned to but never used
tools/ysh_ify.py:411:9 local variable 'has_rhs' is assigned to but never used
tools/ysh_ify.py:423:17 local variable 'lhs0' is assigned to but never used
tools/ysh_ify.py:430:13 local variable 'has_array_index' is assigned to but never used
tools/ysh_ify.py:678:17 local variable 'UP_iterable' is assigned to but never used
yaks/gen_cpp.py:6:1 redefinition of unused 'Int' from line 6
ysh/expr_eval.py:949:9 local variable 'UP_o' is assigned to but never used
-----
Linting Python 3 code
-----
mycpp/const_pass.py:334:17 local variable 'r' is assigned to but never used
mycpp/cppgen_pass.py:1027:17 local variable 'fmt_types' is assigned to but never used
mycpp/mycpp_main.py:177:5 local variable 'start_time' is assigned to but never used
mycpp/mylib.py:506:5 redefinition of unused 'File' from line 105
PASS: check-shebangs
[I] melvin@zoidberg:~/workdir/oil$ echo $?
0

andychu

OK glad it was a simple fix

Can you add a test case with a virtual function? That's what I was in the middle of, but I am reviewing some other PRs now

I believe that is realistic because say SubProgramThunk::Run has a CommandEvaluator and will use it

And it's a virtual function

Also we have virtual functions in vm::_Executor, WordEvaluator, etc.

I think the case we want to catch is that only one of the virtual functions creates a path to mylib.MaybeCollect(), but not all of them

andychu · 2024-05-11T02:10:26Z

mycpp/testpkg/module2.py

@@ -11,3 +12,7 @@ def func2():

  from testpkg import module1
  log(module1.CONST1)
+
+def func3():


Even in the test code I think it's better to be consistent about naming, e.g.

DoesCollect() rather than does_collect()

Func3() rather than func3()

Sometimes we follow MyPy naming but here there is no real reason to

andychu · 2024-05-11T02:18:35Z

Also I wonder if anything can be moved to mycpp/pass_state.py with UNIT tests in mycpp/pass_state_test.py

I added our virtual function heuristic there -- we need to add virtual to certain functions, but the Python source code doesn't have this concept.

So the Virtual state has some clear objects and methods, and we know what the LIMITATIONS of the heuristic are

The reason is mainly because I expect we need to clean up and most likely gradually rewrite mycpp. So having code outside of that big pass file is less coupled to the MyPy AST

(Also if you or anyone else is interested in writing a simple type checker, I think that is possible / in scope, now that we are 100% translated, and know all the idioms we use. This would let us get rid of the pinned version of MyPy

I may start a thread about "copying" the MyPy AST into mycpp.asdl as a possible first step / prototype)

andychu · 2024-05-11T03:12:23Z

Hm yeah we talked about callbacks a little, they are the same problem as virtual functions

e.g. you could have two different functions with signature int f(int, Str*), and one will call mylib.MaybeCollect(), and the other won't

However, I think we almost never use this. We almost always prefer a class with subclasses and then a virtual method like Run(). This is mainly because callbacks always need state that's not passed directly

I am not sure if we even translate callbacks correctly

But is something to think about and probably test. mycpp is still missing test coverage, because we just made it run one program. If we don't support callbacks then we could have mycpp/examples/invalid_* that shows that

Unfortunately right now I don't remember if we do or don't

xxx would be nice to minimize all the flipping between '.' and '::' as scope separators...

This commit adds a library for implementing a simple control flow graph. Right now it doesn't have any meaningful statement annotations, nor is it connected to anything in mycpp. Future commits will build on top of this to generate datalog programs to perform analyses like the call-graph analysis from #1963 or the alias or stack root analysis from #1972.

melvinw changed the base branch from master to soil-staging May 9, 2024 02:00

melvinw force-pushed the call-graph branch 2 times, most recently from 346cbf5 to ecff5d4 Compare May 9, 2024 02:32

mycpp: only emit StackRoots if Collect() possbile

1b232b1

This reduces the number of stack roots by almost 10x.

melvinw force-pushed the call-graph branch from ecff5d4 to 1b232b1 Compare May 9, 2024 03:49

Merge branch 'master' into call-graph

e922a73

Andy Chu added 2 commits May 9, 2024 21:05

add JSON dump of call graph

f873402

Random hacking with some tests

f3f79c9

It segfaults Let's see if it segfaults without this rooting optimization

use full name when callee is a NameExpr

4a6d3ea

andychu reviewed May 11, 2024

View reviewed changes

melvinw closed this May 11, 2024

melvinw reopened this May 11, 2024

melvinw force-pushed the call-graph branch 3 times, most recently from d053201 to 2fb3400 Compare May 11, 2024 20:43

virtual method fix

7b87d6f

xxx would be nice to minimize all the flipping between '.' and '::' as scope separators...

melvinw force-pushed the call-graph branch from 2fb3400 to 7b87d6f Compare May 11, 2024 20:43

melvinw added 3 commits May 11, 2024 14:33

fixup! virtual method fix

afaa478

add pass_state.CallGraph and a simple test

97505c8

fixup! add pass_state.CallGraph and a simple test

6015e75

melvinw added 2 commits May 11, 2024 20:01

decouple callee resolution and graph construction

ddc6681

add path reconstruction, explain virtual approach

6e06ad7

melvinw mentioned this pull request May 25, 2024

mycpp: add ControlFlowGraph library #1975

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mycpp: only emit StackRoots if Collect() possbile #1963

mycpp: only emit StackRoots if Collect() possbile #1963

melvinw commented May 9, 2024 •

edited

melvinw commented May 9, 2024

andychu commented May 10, 2024 •

edited

andychu commented May 10, 2024

melvinw commented May 10, 2024

andychu commented May 10, 2024

melvinw commented May 10, 2024

melvinw commented May 10, 2024 •

edited

melvinw commented May 10, 2024

andychu left a comment

andychu May 11, 2024

andychu commented May 11, 2024

andychu commented May 11, 2024

mycpp: only emit StackRoots if Collect() possbile #1963

Are you sure you want to change the base?

mycpp: only emit StackRoots if Collect() possbile #1963

Conversation

melvinw commented May 9, 2024 • edited

melvinw commented May 9, 2024

andychu commented May 10, 2024 • edited

andychu commented May 10, 2024

melvinw commented May 10, 2024

andychu commented May 10, 2024

melvinw commented May 10, 2024

melvinw commented May 10, 2024 • edited

melvinw commented May 10, 2024

andychu left a comment

Choose a reason for hiding this comment

andychu May 11, 2024

Choose a reason for hiding this comment

andychu commented May 11, 2024

andychu commented May 11, 2024

melvinw commented May 9, 2024 •

edited

andychu commented May 10, 2024 •

edited

melvinw commented May 10, 2024 •

edited