Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hack traversals for speed #110

Merged
merged 7 commits into from Oct 1, 2022
Merged

Hack traversals for speed #110

merged 7 commits into from Oct 1, 2022

Conversation

inducer
Copy link
Owner

@inducer inducer commented Sep 23, 2022

There's actually a path towards merging this now.

  • It would need tests.
  • The mapper optimizer needs a way to inline the cache key, so that we don't have to hardcode it.
  • The mapper optimizer needs documentation. (Decided against this; it requires a fair bit of care to use, and the sharp edges may not be obvious to users. Added a comment.)

The root of this is this contest set up by @kaushikcfd to show that we should all abandon Python for Rust (or some other AOT statically-typed language).

Here's my fork of @kaushikcfd's gist: https://gist.github.com/inducer/25e3372fd5c82384d437c777ed4153e9

Latest results:

  • Starting point for me was 5.18s, which is @kaushikcfd's original with the print statements removed.
  • I'm currently at 1.6s. I am cheating and giving myself credit for the Py3.10 -> Py3.11 transition, but since this is all about changing languages, I guess fair is fair.

@inducer
Copy link
Owner Author

inducer commented Sep 23, 2022

Pypy is 0.90s.

@kaushikcfd
Copy link
Collaborator

For posterity, could also report the runtimes for the rust-port on your machine?

cargo new --bin symoxide_contest
cd symoxide_contest
cargo add symoxide
curl https://gist.githubusercontent.com/kaushikcfd/74c442a075557dad466cd3daea9c151f/raw/d593ce7d5a6de6764e541921472456c8528efb3b/symoxide_expr_traversal.rs > src/main.rs
cargo run --release

@inducer
Copy link
Owner Author

inducer commented Sep 23, 2022

Just ran this. Somewhere around .28s.

nightly-x86_64-unknown-linux-gnu (default)
rustc 1.66.0-nightly (432abd86f 2022-09-20)

@inducer
Copy link
Owner Author

inducer commented Sep 23, 2022

So pypy times seem super variable. Just reran, with

Python 3.8.13 (7.3.9+dfsg-4, Aug 09 2022, 12:51:24)
[PyPy 7.3.9 with GCC 12.1.0]

(nominally unchanged to before) and got 0.63s.

@inducer
Copy link
Owner Author

inducer commented Sep 23, 2022

I have an idea for a Python-side AST-to-AST transform that would eliminate the bounces to rec in the bulk of cases, by effectively inlining the common case of rec. IDK whether I'll be desperate enough this weekend to do it, but I think that might do something, because it would cut the number of frames set up in a pymbolic traversal by ~2.

@kaushikcfd
Copy link
Collaborator

So pypy times seem super variable. Just reran, with

Maybe earlier the JIT costs were also included? Also, not sure if tracing JIT is mature enough to compete with AOT compilers for such workloads in terms of both -- compilation overheads and quality of generated code.

@inducer
Copy link
Owner Author

inducer commented Sep 24, 2022

not sure if tracing JIT is mature enough to compete with AOT compilers

What does "mature" mean to you here? For me, I'm interested in correctness and speed, in that order. 🙂

@kaushikcfd
Copy link
Collaborator

What does "mature" mean to you here? For me, I'm interested in correctness and speed, in that order. slightly_smiling_face

Correctness is the bare minimum 1. I was mainly referring about the JIT overheads (#recompliations and compilation costs)

Footnotes

  1. not saying it is trivial :)

@inducer inducer force-pushed the need-for-speed branch 5 times, most recently from 8ac5423 to 472862e Compare September 25, 2022 18:01
@inducer
Copy link
Owner Author

inducer commented Sep 25, 2022

The "mapper optimizer" is now a thing. It succeeds at removing passed-through *args, **kwargs when those aren't needed, and that's clearly profitable.

It also contains a code path for inlining self.rec and/or the cache retrieval, but those are curiously unprofitable. Not sure I understand why, given that I thought we were bound by frame setup.

@inducer
Copy link
Owner Author

inducer commented Sep 25, 2022

The doc failure is sphinx-doc/sphinx#10861.

@kaushikcfd
Copy link
Collaborator

kaushikcfd commented Sep 26, 2022

Bumped the baseline by a bit, see kaushikcfd/symoxide#3 :).

@inducer inducer force-pushed the need-for-speed branch 2 times, most recently from 457ce74 to 79f21b0 Compare October 1, 2022 20:24
@inducer
Copy link
Owner Author

inducer commented Oct 1, 2022

The final tally here:

  • Py3.11 running the latest version takes 1.57s.
  • Py3.11 without the optimizer takes 2.76s.
  • Py3.10 with the optimizer takes 2.26s.
  • Py3.11 running the benchmark from "Add Kaushik's traversal benchmark" (which has a few improvements for which we shouldn't credit the infrastructure) on 3da6f85 (close to current main) takes 3.9s.
  • Py3.10 running the same thing takes 4.19s.
  • Pypy 7.3.9 (3.8) takes 0.57s, without taking advantage of the mapper optimizer.

There is a loopy failure, but that's just a doctest fail because this turns off flatten-by-default in the IdentityMapper.

@inducer inducer marked this pull request as ready for review October 1, 2022 20:36
@inducer inducer merged commit 3f301c5 into main Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants