Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mycpp: only emit StackRoots if Collect() possbile #1963

Open
wants to merge 11 commits into
base: soil-staging
Choose a base branch
from

Conversation

melvinw
Copy link
Collaborator

@melvinw melvinw commented May 9, 2024

Most instances of StackRoot are unnecessary because they are created in functions where there is no danger of the garbage collector being invoked. This commit changes mycpp to construct a graph of function calls across the entire program and uses it to decide if StackRoots are necessary for a given function. An directed edge u -> v means u calls v. Mycpp builds up this graph during the declaration pass, and passes it to the definition pass, which uses it to search for a path from the body of a function to mylib.MaybeCollect, only emitting StackRoots if a path is found.

This reduces the number of stack roots by almost 10x (EDIT: The new examples caught a bug. This is actually ~5x. The performance impact is about the same, though.) and has a noticeable impact on the wall time of workloads like the CPython configure script (see reports below. ~1-2% or 10-150ms). It's important to note, though that this optimization only considers the static call graph. Many of the remaining functions with StackRoots are functions in the CommandEvaluator that are mutually recursive. Reducing overhead from those will require a different approach.

stats from master

+ perf stat -r 20 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000' (20 runs):

          1,019.89 msec task-clock                       #    0.969 CPUs utilized               ( +-  0.03% )
             1,138      context-switches                 #    1.116 K/sec                       ( +-  0.12% )
                 1      cpu-migrations                   #    0.980 /sec
           136,557      page-faults                      #  133.893 K/sec
     2,103,559,764      cycles                           #    2.063 GHz                         ( +-  0.95% )  (97.96%)
     1,086,119,441      stalled-cycles-frontend          #   51.63% frontend cycles idle        ( +-  1.38% )  (97.01%)
       591,005,976      stalled-cycles-backend           #   28.10% backend cycles idle         ( +-  2.17% )  (88.98%)
     1,793,734,183      instructions                     #    0.85  insn per cycle
                                                  #    0.61  stalled cycles per insn     ( +-  1.33% )  (98.37%)
       409,963,765      branches                         #  401.967 M/sec                       ( +-  1.01% )  (98.49%)
         9,663,067      branch-misses                    #    2.36% of all branches             ( +-  1.02% )  (98.04%)

          1.052296 +- 0.000366 seconds time elapsed  ( +-  0.03% )
+ perf stat -r 10 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure' (10 runs):

         23,876.67 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.01% )
            15,161      context-switches                 #  634.971 /sec                        ( +-  0.17% )
                 1      cpu-migrations                   #    0.042 /sec
         2,036,884      page-faults                      #   85.309 K/sec                       ( +-  0.00% )
    54,840,613,270      cycles                           #    2.297 GHz                         ( +-  0.09% )  (86.61%)
    26,062,053,505      stalled-cycles-frontend          #   47.52% frontend cycles idle        ( +-  0.14% )  (86.72%)
    16,536,908,235      stalled-cycles-backend           #   30.15% backend cycles idle         ( +-  0.18% )  (72.95%)
    57,104,919,153      instructions                     #    1.04  insn per cycle
                                                  #    0.46  stalled cycles per insn     ( +-  0.07% )  (86.88%)
    11,921,483,800      branches                         #  499.294 M/sec                       ( +-  0.05% )  (86.92%)
       336,030,711      branch-misses                    #    2.82% of all branches             ( +-  0.09% )  (86.38%)

          24.11207 +- 0.00243 seconds time elapsed  ( +-  0.01% )

stats from this branch

+ perf stat -r 20 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/testdata/osh-runtime/bin_true.sh 1000' (20 runs):

          1,002.17 msec task-clock                       #    0.969 CPUs utilized               ( +-  0.05% )
             1,131      context-switches                 #    1.129 K/sec                       ( +-  0.13% )
                 0      cpu-migrations                   #    0.000 /sec
           134,551      page-faults                      #  134.260 K/sec
     2,048,454,210      cycles                           #    2.044 GHz                         ( +-  1.16% )  (97.09%)
     1,056,078,784      stalled-cycles-frontend          #   51.55% frontend cycles idle        ( +-  1.62% )  (97.11%)
       566,911,785      stalled-cycles-backend           #   27.68% backend cycles idle         ( +-  2.48% )  (89.54%)
     1,711,132,647      instructions                     #    0.84  insn per cycle
                                                  #    0.62  stalled cycles per insn     ( +-  1.63% )  (94.09%)
       406,395,132      branches                         #  405.516 M/sec                       ( +-  1.13% )  (99.02%)
         9,807,102      branch-misses                    #    2.41% of all branches             ( +-  0.62% )  (98.15%)

          1.034549 +- 0.000544 seconds time elapsed  ( +-  0.05% )
+ perf stat -r 10 -o perf-output.txt taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure

 Performance counter stats for 'taskset -a -c 2 /home/melvin/workdir/oil/_bin/cxx-opt/osh /home/melvin/workdir/oil/Python-2.7.13/configure' (10 runs):

         23,715.08 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.02% )
            15,137      context-switches                 #  638.286 /sec                        ( +-  0.27% )
                 0      cpu-migrations                   #    0.000 /sec
         2,023,000      page-faults                      #   85.304 K/sec                       ( +-  0.00% )
    54,591,018,618      cycles                           #    2.302 GHz                         ( +-  0.06% )  (86.59%)
    25,806,949,417      stalled-cycles-frontend          #   47.27% frontend cycles idle        ( +-  0.13% )  (86.94%)
    16,316,561,268      stalled-cycles-backend           #   29.89% backend cycles idle         ( +-  0.20% )  (72.63%)
    56,691,411,513      instructions                     #    1.04  insn per cycle
                                                  #    0.46  stalled cycles per insn     ( +-  0.06% )  (87.01%)
    11,871,229,727      branches                         #  500.577 M/sec                       ( +-  0.04% )  (86.41%)
       335,151,243      branch-misses                    #    2.82% of all branches             ( +-  0.11% )  (86.52%)

          23.95348 +- 0.00476 seconds time elapsed  ( +-  0.02% )

@melvinw melvinw changed the base branch from master to soil-staging May 9, 2024 02:00
@melvinw
Copy link
Collaborator Author

melvinw commented May 9, 2024

This also improves many of the cachegrind benchmarks by a good margin: http://travis-ci.oilshell.org/github-jobs/6915/benchmarks2.wwz/_tmp/gc-cachegrind/index.html

@melvinw melvinw force-pushed the call-graph branch 2 times, most recently from 346cbf5 to ecff5d4 Compare May 9, 2024 02:32
This reduces the number of stack roots by almost 10x.
@andychu
Copy link
Contributor

andychu commented May 10, 2024

Thanks for looking into this, it's pretty awesome that it makes us clearly faster than bash on several benchmarks/compute, in addition to Python configure speedups

http://travis-ci.oilshell.org/github-jobs/6921/benchmarks.wwz/_tmp/compute/index.html

Also it seems to be a pretty small amount of code

It seems like we clearly need something like this -- it's a major improvement.

And I'm glad you got past some of the "smells" like module aliases


I am going to poke at the tests a bit, I think there are some more cases we could add

Also I am dying to get rid of the self.decl conditionals, and they came up in this PR. But I remember I looked into it a month or 2 ago while I was doing some significant mycpp enhancements

And maybe it seemed too onerous

Since this adds some extra logic in the decl pass, it might be worth looking into again

But more important is if there are any cases where this could cause a runtime error, i.e. with new code patterns from contributors

I would also like to make a mycpp.asdl file to kinda "lock down" the dialect of typed Python we're using. And maybe change from this inverted "visitor" style to a plain old switch statement (although maybe that causes duplicate traversal logic)

That might be possible without a full "yaks" rewrite ...

Andy Chu added 2 commits May 9, 2024 21:05
It segfaults

Let's see if it segfaults without this rooting optimization
@andychu
Copy link
Contributor

andychu commented May 10, 2024

Hm I started hacking on some more tests and was able to break it

I pulled the new tests into a branch based off master, and ran them, and they pass

https://github.com/oilshell/oil/tree/test-call-graph

I didn't look into the root cause, but it appears this change causes a problem. The test case is pretty simple -- it creates a big list, deletes some items, and then collects periodically in a loop

ASAN flags the error, so you can just run the test_root_call_graph by itself to see the error


I didn't add the virtual function case yet, so there may be more

I think we need something like this, but it's not urgent and we might want to take some time to make mycpp better (depending on what the bug is)

@melvinw
Copy link
Collaborator Author

melvinw commented May 10, 2024

Nice! I'll take a look at this over the weekend. I'm curious what's getting tickled

@andychu
Copy link
Contributor

andychu commented May 10, 2024

BTW this gets rid of about 400 KiB of executable code

After - http://travis-ci.oilshell.org/github-jobs/6929/cpp-coverage.wwz/_tmp/metrics/compare-gcc-clang.txt

-rwxr-xr-x 1 uke uke  2200704 May 10 01:09 _bin/clang-opt/oils-for-unix.stripped
-rwxr-xr-x 1 uke uke  1708304 May 10 01:09 _bin/cxx-opt/oils-for-unix.stripped

Before:

-rwxr-xr-x 1 uke uke  3146944 May 10 04:14 _bin/clang-opt/oils-for-unix.stripped
-rwxr-xr-x 1 uke uke  2126160 May 10 04:14 _bin/cxx-opt/oils-for-unix.stripped

Actually nearly 1 MB for Clang for some reason !! We have a somewhat old version of Clang I think


So I hope that we can do this correctly within the existing mycpp, with a few modifications

But if not it might be some motivation to rework mycpp a bit

@melvinw
Copy link
Collaborator Author

melvinw commented May 10, 2024

Oh right! I forgot that the stack roots contributed so much to the size of the binary. That's an awesome second result

@melvinw
Copy link
Collaborator Author

melvinw commented May 10, 2024

Aha! I was using the caller's name instead of the callee as the destination of the edge in the trivial case! Apparently we don't have many calls like this that lead to the GC in the spec. The fix added less than 100 roots.

@melvinw
Copy link
Collaborator Author

melvinw commented May 10, 2024

The linter failure from CI is confusing. It's complaining about an unused import io, but we clearly call io.StringIO() twice...

I don't see it locally either.

[I] melvin@zoidberg:~/workdir/oil$ test/lint.sh soil-run
-----
Linting Python 2 code
-----
asdl/front_end.py:42:30 list comprehension redefines '_' from line 11
asdl/gen_cpp.py:406:13 local variable 'variant_name' is assigned to but never used
builtin/func_hay.py:52:9 local variable 'parse_opts' is assigned to but never used
builtin/json_ysh.py:64:13 local variable 'arg_jw' is assigned to but never used
builtin/read_osh.py:304:13 local variable 'lexer' is assigned to but never used
core/executor.py:40:5 redefinition of unused 'state' from line 24
core/state.py:585:9 local variable 'success' is assigned to but never used
data_lang/j8.py:141:30 list comprehension redefines 'b' from line 136
osh/glob_.py:214:23 list comprehension redefines '_' from line 207
osh/glob_.py:370:17 local variable 'is_glob' is assigned to but never used
osh/split.py:80:34 list comprehension redefines 'buf' from line 64
osh/word_compile.py:202:13 local variable 'line_ended' is assigned to but never used
osh/word_eval.py:1047:13 local variable 'UP_val' is assigned to but never used
osh/word_parse.py:954:9 local variable 'right_token' is assigned to but never used
tools/ysh_ify.py:337:9 local variable 'op_id' is assigned to but never used
tools/ysh_ify.py:411:9 local variable 'has_rhs' is assigned to but never used
tools/ysh_ify.py:423:17 local variable 'lhs0' is assigned to but never used
tools/ysh_ify.py:430:13 local variable 'has_array_index' is assigned to but never used
tools/ysh_ify.py:678:17 local variable 'UP_iterable' is assigned to but never used
yaks/gen_cpp.py:6:1 redefinition of unused 'Int' from line 6
ysh/expr_eval.py:949:9 local variable 'UP_o' is assigned to but never used
-----
Linting Python 3 code
-----
mycpp/const_pass.py:334:17 local variable 'r' is assigned to but never used
mycpp/cppgen_pass.py:1027:17 local variable 'fmt_types' is assigned to but never used
mycpp/mycpp_main.py:177:5 local variable 'start_time' is assigned to but never used
mycpp/mylib.py:506:5 redefinition of unused 'File' from line 105
PASS: check-shebangs
[I] melvin@zoidberg:~/workdir/oil$ echo $?
0

Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK glad it was a simple fix

Can you add a test case with a virtual function? That's what I was in the middle of, but I am reviewing some other PRs now

I believe that is realistic because say SubProgramThunk::Run has a CommandEvaluator and will use it

And it's a virtual function

Also we have virtual functions in vm::_Executor, WordEvaluator, etc.

I think the case we want to catch is that only one of the virtual functions creates a path to mylib.MaybeCollect(), but not all of them

@@ -11,3 +12,7 @@ def func2():

from testpkg import module1
log(module1.CONST1)

def func3():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the test code I think it's better to be consistent about naming, e.g.

DoesCollect() rather than does_collect()

Func3() rather than func3()

Sometimes we follow MyPy naming but here there is no real reason to

@andychu
Copy link
Contributor

andychu commented May 11, 2024

Also I wonder if anything can be moved to mycpp/pass_state.py with UNIT tests in mycpp/pass_state_test.py

I added our virtual function heuristic there -- we need to add virtual to certain functions, but the Python source code doesn't have this concept.

So the Virtual state has some clear objects and methods, and we know what the LIMITATIONS of the heuristic are

The reason is mainly because I expect we need to clean up and most likely gradually rewrite mycpp. So having code outside of that big pass file is less coupled to the MyPy AST


(Also if you or anyone else is interested in writing a simple type checker, I think that is possible / in scope, now that we are 100% translated, and know all the idioms we use. This would let us get rid of the pinned version of MyPy

I may start a thread about "copying" the MyPy AST into mycpp.asdl as a possible first step / prototype)

@melvinw melvinw closed this May 11, 2024
@melvinw melvinw reopened this May 11, 2024
@andychu
Copy link
Contributor

andychu commented May 11, 2024

Hm yeah we talked about callbacks a little, they are the same problem as virtual functions

e.g. you could have two different functions with signature int f(int, Str*), and one will call mylib.MaybeCollect(), and the other won't

However, I think we almost never use this. We almost always prefer a class with subclasses and then a virtual method like Run(). This is mainly because callbacks always need state that's not passed directly

I am not sure if we even translate callbacks correctly

But is something to think about and probably test. mycpp is still missing test coverage, because we just made it run one program. If we don't support callbacks then we could have mycpp/examples/invalid_* that shows that

Unfortunately right now I don't remember if we do or don't

@melvinw melvinw force-pushed the call-graph branch 3 times, most recently from d053201 to 2fb3400 Compare May 11, 2024 20:43
xxx would be nice to minimize all the flipping between '.' and '::' as
scope separators...
melvinw added a commit that referenced this pull request May 25, 2024
This commit adds a library for implementing a simple control flow graph.
Right now it doesn't have any meaningful statement annotations, nor is
it connected to anything in mycpp. Future commits will build on top of
this to generate datalog programs to perform analyses like the
call-graph analysis from #1963 or the alias or stack root analysis
from #1972.
andychu pushed a commit that referenced this pull request May 25, 2024
This commit adds a library for implementing a simple control flow graph.
Right now it doesn't have any meaningful statement annotations, nor is
it connected to anything in mycpp. Future commits will build on top of
this to generate datalog programs to perform analyses like the
call-graph analysis from #1963 or the alias or stack root analysis
from #1972.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants