-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746
Conversation
(i.e. when the argument goes through PyTuple_Check{,Exact} or PyTuple_Size (which calls PyTuple_Check) first, or is constructed using Py_BuildValue; we don't assume any invariants across functions.) This avoids having to call back into libpython; it's just a struct/array access at a known offset instead. It is however suprisingly difficult to actually benchmark such changes, because e.g. gcc appears to use wildly different optimizations(?) with and without the patch, so the benchmark deltas are all over the place.
The macros are part of the |
I guess you meant "not part of the Py_LIMITED_API". |
Yes, I meant "not part of", thanks. |
I wonder if doing the opposite - replacing macros with the functions that are part of the limited API - would show a measurable slowdown |
Sounds reasonable, but at least I haven't figured out how to get stable benchmarks :( |
@mattip how problematic are these really? I mean a big chunk of "limited API" is actually about "stable ABI". But I am not sold that stable ABI is an important (or realistic) aim for NumPy: We will probably always have binaries for all support Python versions. Does using these macros make a real difference for HPy or PyPy? For smaller projects I agree, but for NumPy I wonder if there is much value in avoiding these "common" macros. That said, my guess is at most 3 of these instances actually may be measurable (most importantly the
You have to time something specific to these, even with |
They are not problematic. On the other hand, I think it is very difficult to show that using a macro instead of a function is impactful. I would be glad to be proved wrong. I think there is different refactoring we could do to speed up the code by avoiding the Python C-API unless we absolutely have to use it. For instance we could avoid using a tuple for the Edit: which is not to say that anyone must spend time doing the refactoring. |
What I did see is that https://en.wikipedia.org/wiki/Perf_(Linux) showed a few percent of samples in calls to PyTuple_GetItem in the loadtxt benchmarks (from my current other series of PRs), and that's why it seemed reasonable to try to get rid of them. But as noted above and elsewhere, removing these doesn't actually improve the benchmarks. |
Should we close this? I don't care enough either way to go ahead. E.g. the first one and the one in I mention the void setitem since it may be measurable if you create a new void array from a large number of tuples. Not sure how certain the |
Let's just close this for now. |
(i.e. when the argument goes through PyTuple_Check{,Exact} or
PyTuple_Size (which calls PyTuple_Check) first, or is constructed using
Py_BuildValue; we don't assume any invariants across functions.)
This avoids having to call back into libpython; it's just a struct/array
access at a known offset instead.
It is however suprisingly difficult to actually benchmark such changes,
because e.g. gcc appears to use wildly different optimizations(?) with
and without the patch, so the benchmark deltas are all over the place
(from 1.5x to 0.5x...).
(See also #19735 (comment).)
So it's more a patch proposed "on general principles" rather than with
well-established benefits.