PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746

anntzer · 2021-08-24T14:56:49Z

(i.e. when the argument goes through PyTuple_Check{,Exact} or
PyTuple_Size (which calls PyTuple_Check) first, or is constructed using
Py_BuildValue; we don't assume any invariants across functions.)

This avoids having to call back into libpython; it's just a struct/array
access at a known offset instead.

It is however suprisingly difficult to actually benchmark such changes,
because e.g. gcc appears to use wildly different optimizations(?) with
and without the patch, so the benchmark deltas are all over the place
(from 1.5x to 0.5x...).
(See also #19735 (comment).)
So it's more a patch proposed "on general principles" rather than with
well-established benefits.

(i.e. when the argument goes through PyTuple_Check{,Exact} or PyTuple_Size (which calls PyTuple_Check) first, or is constructed using Py_BuildValue; we don't assume any invariants across functions.) This avoids having to call back into libpython; it's just a struct/array access at a known offset instead. It is however suprisingly difficult to actually benchmark such changes, because e.g. gcc appears to use wildly different optimizations(?) with and without the patch, so the benchmark deltas are all over the place.

mattip · 2021-08-24T15:25:53Z

The macros are part of the Py_LIMITED_API. If it cannot be demonstrated that these changes speed up larger benchmarks, I think this is moving in the wrong direction.

anntzer · 2021-08-24T15:58:58Z

I guess you meant "not part of the Py_LIMITED_API".
Note that there's already a lot of uses of PyTuple_GET_ITEM/GET_SIZE throughout the numpy codebase; this patch is only adding a few more...

mattip · 2021-08-24T16:06:10Z

Yes, I meant "not part of", thanks.

mattip · 2021-08-24T16:07:59Z

I wonder if doing the opposite - replacing macros with the functions that are part of the limited API - would show a measurable slowdown

anntzer · 2021-08-24T16:13:55Z

Sounds reasonable, but at least I haven't figured out how to get stable benchmarks :(

seberg · 2021-08-24T16:15:11Z

@mattip how problematic are these really? I mean a big chunk of "limited API" is actually about "stable ABI". But I am not sold that stable ABI is an important (or realistic) aim for NumPy: We will probably always have binaries for all support Python versions.

Does using these macros make a real difference for HPy or PyPy? For smaller projects I agree, but for NumPy I wonder if there is much value in avoiding these "common" macros.

That said, my guess is at most 3 of these instances actually may be measurable (most importantly the loop.c.src one but that is probably practically unused, so it also doesn't matter much probably).

Sounds reasonable, but at least I haven't figured out how to get stable benchmarks :(

You have to time something specific to these, even with %timeit if you like. Although for most of the changes, you won't be able to, or will be hard-pressed to find something relevant probably.

mattip · 2021-08-24T16:38:28Z

how problematic are these really?

They are not problematic. On the other hand, I think it is very difficult to show that using a macro instead of a function is impactful. I would be glad to be proved wrong. I think there is different refactoring we could do to speed up the code by avoiding the Python C-API unless we absolutely have to use it. For instance we could avoid using a tuple for the info argument entirely in PyUfunc_AddLoop.

Edit: which is not to say that anyone must spend time doing the refactoring.

anntzer · 2021-08-24T17:10:09Z

show that using a macro instead of a function is impactful

What I did see is that https://en.wikipedia.org/wiki/Perf_(Linux) showed a few percent of samples in calls to PyTuple_GetItem in the loadtxt benchmarks (from my current other series of PRs), and that's why it seemed reasonable to try to get rid of them. But as noted above and elsewhere, removing these doesn't actually improve the benchmarks.

seberg · 2021-09-08T18:47:58Z

Should we close this? I don't care enough either way to go ahead. E.g. the first one and the one in VOID_setitem are something I could imagine are measurable (I am not sure where NPY_TITLE_KEY_check is used). But most look like they just can't matter anyway (i.e. singular overhead in a Python call, or there are calls like BuildValue which are probably much slower anyway).

I mention the void setitem since it may be measurable if you create a new void array from a large number of tuples.

Not sure how certain the perf data is, although I guess inspecting it carefully you may be able to see if it is indeed call overhead or the time it takes to read the tuples content from memory.

anntzer · 2021-09-08T19:00:36Z

Let's just close this for now.

anntzer closed this Sep 8, 2021

anntzer deleted the pytuple branch September 8, 2021 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746

PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746

anntzer commented Aug 24, 2021

mattip commented Aug 24, 2021

anntzer commented Aug 24, 2021

mattip commented Aug 24, 2021

mattip commented Aug 24, 2021

anntzer commented Aug 24, 2021

seberg commented Aug 24, 2021

mattip commented Aug 24, 2021 •

edited

anntzer commented Aug 24, 2021

seberg commented Sep 8, 2021

anntzer commented Sep 8, 2021

PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746

PERF: Use fast tuple access macros when the arg is clearly a tuple. #19746

Conversation

anntzer commented Aug 24, 2021

mattip commented Aug 24, 2021

anntzer commented Aug 24, 2021

mattip commented Aug 24, 2021

mattip commented Aug 24, 2021

anntzer commented Aug 24, 2021

seberg commented Aug 24, 2021

mattip commented Aug 24, 2021 • edited

anntzer commented Aug 24, 2021

seberg commented Sep 8, 2021

anntzer commented Sep 8, 2021

mattip commented Aug 24, 2021 •

edited