Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746

Closed
wants to merge 1 commit into from

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Aug 24, 2021

(i.e. when the argument goes through PyTuple_Check{,Exact} or
PyTuple_Size (which calls PyTuple_Check) first, or is constructed using
Py_BuildValue; we don't assume any invariants across functions.)

This avoids having to call back into libpython; it's just a struct/array
access at a known offset instead.

It is however suprisingly difficult to actually benchmark such changes,
because e.g. gcc appears to use wildly different optimizations(?) with
and without the patch, so the benchmark deltas are all over the place
(from 1.5x to 0.5x...).
(See also #19735 (comment).)
So it's more a patch proposed "on general principles" rather than with
well-established benefits.

(i.e. when the argument goes through PyTuple_Check{,Exact} or
PyTuple_Size (which calls PyTuple_Check) first, or is constructed using
Py_BuildValue; we don't assume any invariants across functions.)

This avoids having to call back into libpython; it's just a struct/array
access at a known offset instead.

It is however suprisingly difficult to actually benchmark such changes,
because e.g. gcc appears to use wildly different optimizations(?) with
and without the patch, so the benchmark deltas are all over the place.
@mattip
Copy link
Member

mattip commented Aug 24, 2021

The macros are part of the Py_LIMITED_API. If it cannot be demonstrated that these changes speed up larger benchmarks, I think this is moving in the wrong direction.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 24, 2021

I guess you meant "not part of the Py_LIMITED_API".
Note that there's already a lot of uses of PyTuple_GET_ITEM/GET_SIZE throughout the numpy codebase; this patch is only adding a few more...

@mattip
Copy link
Member

mattip commented Aug 24, 2021

Yes, I meant "not part of", thanks.

@mattip
Copy link
Member

mattip commented Aug 24, 2021

I wonder if doing the opposite - replacing macros with the functions that are part of the limited API - would show a measurable slowdown

@anntzer
Copy link
Contributor Author

anntzer commented Aug 24, 2021

Sounds reasonable, but at least I haven't figured out how to get stable benchmarks :(

@seberg
Copy link
Member

seberg commented Aug 24, 2021

@mattip how problematic are these really? I mean a big chunk of "limited API" is actually about "stable ABI". But I am not sold that stable ABI is an important (or realistic) aim for NumPy: We will probably always have binaries for all support Python versions.

Does using these macros make a real difference for HPy or PyPy? For smaller projects I agree, but for NumPy I wonder if there is much value in avoiding these "common" macros.

That said, my guess is at most 3 of these instances actually may be measurable (most importantly the loop.c.src one but that is probably practically unused, so it also doesn't matter much probably).

Sounds reasonable, but at least I haven't figured out how to get stable benchmarks :(

You have to time something specific to these, even with %timeit if you like. Although for most of the changes, you won't be able to, or will be hard-pressed to find something relevant probably.

@mattip
Copy link
Member

mattip commented Aug 24, 2021

how problematic are these really?

They are not problematic. On the other hand, I think it is very difficult to show that using a macro instead of a function is impactful. I would be glad to be proved wrong. I think there is different refactoring we could do to speed up the code by avoiding the Python C-API unless we absolutely have to use it. For instance we could avoid using a tuple for the info argument entirely in PyUfunc_AddLoop.

Edit: which is not to say that anyone must spend time doing the refactoring.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 24, 2021

show that using a macro instead of a function is impactful

What I did see is that https://en.wikipedia.org/wiki/Perf_(Linux) showed a few percent of samples in calls to PyTuple_GetItem in the loadtxt benchmarks (from my current other series of PRs), and that's why it seemed reasonable to try to get rid of them. But as noted above and elsewhere, removing these doesn't actually improve the benchmarks.

@seberg
Copy link
Member

seberg commented Sep 8, 2021

Should we close this? I don't care enough either way to go ahead. E.g. the first one and the one in VOID_setitem are something I could imagine are measurable (I am not sure where NPY_TITLE_KEY_check is used). But most look like they just can't matter anyway (i.e. singular overhead in a Python call, or there are calls like BuildValue which are probably much slower anyway).

I mention the void setitem since it may be measurable if you create a new void array from a large number of tuples.

Not sure how certain the perf data is, although I guess inspecting it carefully you may be able to see if it is indeed call overhead or the time it takes to read the tuples content from memory.

@anntzer
Copy link
Contributor Author

anntzer commented Sep 8, 2021

Let's just close this for now.

@anntzer anntzer closed this Sep 8, 2021
@anntzer anntzer deleted the pytuple branch September 8, 2021 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants