Contribute Pillow-SIMD back to Pillow #8

homm · 2016-07-12T07:38:38Z

Problems and solutions of merging SIMD code to the main codebase.

homm · 2016-07-12T08:19:26Z

Compile-time decision

In general, SIMD code will looks like this:

#if defined(__SSE4__)
    #include <emmintrin.h>
    #include <mmintrin.h>
    #include <smmintrin.h>

    #if defined(__AVX2__)
        #include <immintrin.h>
    #endif
#endif

void
ImagingResampleSomething(UINT32 *lineOut, UINT32 *lineIn, int xmax)
{
    var x = 0;

#if defined(__AVX2__)
    for (; x < xmax - 7; x += 8) {
        // AVX2 code
    }
#endif

#if defined(__SSE4__)
    for (; x < xmax - 3; x += 4) {
        // SSE4 code
    }
#endif

    for (; x < xmax; x += 1) {
        // General x86 code
    }
}

If we are compiling the code with -mavx2 argument, general + SSE4 + AVX2 version is generated. If -msse4 — general + SSE4 version. Without flags, only general version is compiled.

We have to set compiler flags in setup.py, so we'll need access to CPUID from the setup. There is a PyCPUID package but is doesn't support AVX at all.

Cons:

Pillow compiled on one machine will not run on full speed on other, or, worse, will contain illegal instructions.
There is no way to switch between SIMD-versions at runtime. Which makes unit testing harder and longer.
It is not clear how to detect CPUID at setup time.

homm · 2016-07-12T08:38:08Z

Different .so modules

Suggested by @socketpair. There are three .so files compiled with different flags:

General version with no additional flags
General + SSE4 with -msse4
General + SSE4 + AVX2 with -mavx2

We are detecting CPUID and load appropriate modules from python code.

Pros:

Easy switching between different versions (for unit tests or in case of problems with automatic detection)
Looks like easy building with setuptools

Cons:

3x compile time

homm · 2016-07-12T08:48:55Z

Different functions in different .o files

Suggested by @toshic here (rus). We pass compiler flags to each .c file, not whole module. So we can compile several .c files to the different .o files. Of event the same file with different flags to the different .o files and link them all.

void
#if defined(__AVX2__)
    ImagingResampleSomething_avx2(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#elif defined(__SSE4__)
    ImagingResampleSomething_sse4(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#else
    ImagingResampleSomething_general(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#endif
{
    // Same code as for compile-time decision 
}

Pros:

Easy switching between different versions (for unit tests or in case of problems with automatic detection)
Minor compilation time increase

Cons:

It's not clear how to do this with setuptools

socketpair · 2016-07-12T11:11:19Z

I think that 3x compile time is not a problem at all novadays. Simplicity is the key, especially when no performance impact occurs.

Linking .o files compiled with different options may gain some nasty errors and AFAIK is not guaranteed to work. Compiler may use it's own kitchen on alignments and other internal stuff. Also, these options affect intrisincs like memcpy, so even non-specialized code will work faster when compiling with appropriate options.

I vote for my variant :)

Also, it will be nice yo have API to force loading of .so with specific optimisation level. Say, for testing how optimisations are effective, or if automatic CPUID detection choose wrong version.

cancan101 · 2017-07-07T05:20:05Z

There was a similar conversation on the Tensorflow issue tracker with some suggested approaches: tensorflow/tensorflow#7257 (comment).

akx · 2017-07-19T08:01:33Z

Would it be possible to package just the SIMD-accelerated routines, i.e. not duplicate all of Pillow in the fork?

I.e.

from PIL.Image import Image
from pillow_simd import resize
i = Image.open(...)
i = resize(i, (500, 500), ...)

JGoutin · 2017-08-19T12:28:26Z

I have the same problems with projects at work. So I am working on this for trying find a easy to use solution.

grafi-tt · 2017-12-30T04:05:24Z

Hi, I have interest in this issue, and implemented AVX2 / SS4 runtime switching (grafi-tt@96f18c6)

The usage is like:

import PIL
from PIL import Image
import timeit

IMG_FILE = "image-1920x1200.png"

print(PIL.get_available_builds())
print(PIL.get_build())

img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

PIL.set_build("SSE4")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

PIL.set_build("AVX2")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

PIL.set_build("SSE4")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

and the following is its (possible) output.

['AVX2', 'SSE4', 'GENERIC']
AVX2
5.207448414999817
6.874838170999283
5.21245184000054
6.8702469989993915

Following the strategy @socketpair and @homm mentioned, I made the core imaging module (_imaging) built twice with appending distinct suffixes to their names. Runtime CPU checking is done in a tiny C module which calls cpuid instruction using inline assembly. I believe py-cpuinfo is overkill, especially for a module with few dependencies.

Generic build (AVX2 or SSE4 not used) is not available yet, as making it requires back-porting Pillow's non-SIMD codes. Though I could manage to do this, but it is difficult as I'm new to Pillow and Pillow-SIMD. I think it is an obvious work for developers used to the codebase of Pillow-SIMD.

If someone is interested, feel free to fork and improve my implementation; or if preferred, I'm happy to send a PR.

JGoutin · 2018-05-30T18:23:56Z

Project I linked below is completed and may be a solution to this problem.
I didn't try it with Pillow-simd but it should to be very easy to implement with something like this:

# Top of "PIL/Image.py"
try:
    import compilertools
except ImportError:
    pass

# "setup.py"

# [...]

from setuptools import Extension, setup, find_packages

try:
    import compilertools.build
    compilertools.build.ConfigBuild.suffixes_includes = ['avx2', 'sse4']
except ImportError:
    pass

# [...]

setup(
        # [...]
        install_requires=['...', 'compilertools'],
        # [...]
        )

adam-azarchs · 2018-07-11T06:23:58Z

The best way to solve the compile-time decision problem on recent-ish gcc versions would be to use the target attribute (see gcc documtionation) on functions. This causes the compiler to emit several versions of a function, each for a different architecture, with some magic to have the dynamic linker fix up the relocation tables based on run-time CPU detection at library load time. The problem is, that attribute isn't well supported on other compilers. clang, for example, should be getting support for that attribute in version 7, which will be out in September, hopefully.

grafi-tt · 2018-07-11T08:35:47Z

Well, using poor inline assembly codes (like what I've wrote) is admittedly fragile. It surely causes tricky issues. (E.g. https://stackoverflow.com/questions/12221646/how-to-call-cpuid-instruction-in-a-mac-framework).

Using the target attribute seems to be reasonable. Its support is actually not so recent-ish (from 4.8: https://gcc.gnu.org/wiki/FunctionMultiVersioning), and waiting Clang's support is acceptable. Even ARM NEON support would be possible in future :) Intel C Compiler also has a similar feature, so someone interested can write a patch to support it.

Regarding manual detection, the problem is there is no open-source simple, portable, easy-to-use and stable library for CPU detection. (https://wiki.linaro.org/LEG/Engineering/OPTIM/Assembly#Hardware_identification) Using __builtin_cpu_supports intrinsics is one possibility. It's support is in GCC 4.8.0 and Clang 3.9.0 or later. (Clang 3.7 and 3.8 had broken implementation: https://bugs.llvm.org/show_bug.cgi?id=25510). Numpy already uses this (https://github.com/numpy/numpy/blob/e0f884692f629d931d3673adc8dd6b0938cefa52/numpy/core/src/umath/cpuid.c). Intel C Compiler again has a similar feature and MSVC has a more primitive cpuid intrinsics, so supporting them is possible if desired.

QuLogic · 2019-03-25T09:27:42Z

Different functions in different .o files

If you want to go down this route, you might find some inspiration in pyfastnoisesimd. At build time, compiler support is queried here and the relevant files are enabled. At runtime, this function is used to determine support and the correct function is dispatched based on the results.

farleylai · 2020-11-10T00:38:07Z

It remains not straightforward to have Pillow-SIMD installed through dependencies (e.g. pytorch can only list Pillow).
The README mentions Why do not contribute SIMD to the original Pillow but doesn't OpenCV have the same concern?
It seems like OpenCV manages to distribute through anaconda repo with avx2 support through CPU dispatch code generation such that rebuilding on the target platform is not necessary.
Is there any other difficulty preventing this kind of approach?

adam-azarchs · 2020-11-10T01:09:58Z

As stated before, CPU dispatch is tricky to get right in a platform-independent way. OpenCV is a much larger project with more resources to maintain the build apparatus for doing such things properly, in particular doing dynamic dispatch in a way that doesn't severely impact runtime performance.

If you were only targetting gcc, then the target_clones attribute would basically solve the problem for you. Otherwise things get a bit more complicated.

Both clang and gcc support the target attribute for compiling a given method with a different architecture from the rest of the code. So you can kind of fake target_clones by making e.g. foo_avx and foo_sse and so on, having each of them calling foo_base (using also the flatten to make sure that method gets inlined into the arch-specific versions), and then use the ifunc to set up the resolver for dynamic dispatch at load time. So it can be done, just a lot of yak shaving. In terms of manually writing the resolver, https://github.com/google/cpu_features might be overkill but does solve that problem for you.

QuLogic · 2022-01-29T01:52:56Z

NumPy now has a pretty comprehensive SIMD build and dispatch system, but I don't know how portable it would be to Pillow since it doesn't depend on NumPy.

Removed steps that are currently unnecessary. Hopefully they stay that way.

hameerabbasi · 2024-04-05T06:56:11Z

I'll just drop https://github.com/google/highway in here.

homm mentioned this issue Oct 3, 2016

Include Pillow-SIMD in Pillow python-pillow/Pillow#2148

Closed

tassaron mentioned this issue Jul 12, 2017

Package for Easier Installation djfun/audio-visualizer-python#28

Open

tshirtman mentioned this issue Feb 14, 2021

Pillow-SIMD recipe? kivy/python-for-android#2420

Closed

homm mentioned this issue Nov 10, 2021

Add SIMD support by turning on/off global option python-pillow/Pillow#5791

Closed

Dawars pushed a commit to Dawars/pillow-simd that referenced this issue Mar 12, 2023

Merge pull request uploadcare#8 from radarhere/add-cygwin-to-ci

c535601

Removed steps that are currently unnecessary. Hopefully they stay that way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contribute Pillow-SIMD back to Pillow #8

Contribute Pillow-SIMD back to Pillow #8

homm commented Jul 12, 2016

homm commented Jul 12, 2016 •

edited

homm commented Jul 12, 2016

homm commented Jul 12, 2016 •

edited

socketpair commented Jul 12, 2016 •

edited

cancan101 commented Jul 7, 2017

akx commented Jul 19, 2017

JGoutin commented Aug 19, 2017

grafi-tt commented Dec 30, 2017

JGoutin commented May 30, 2018

adam-azarchs commented Jul 11, 2018

grafi-tt commented Jul 11, 2018

QuLogic commented Mar 25, 2019

farleylai commented Nov 10, 2020

adam-azarchs commented Nov 10, 2020

QuLogic commented Jan 29, 2022

hameerabbasi commented Apr 5, 2024

Contribute Pillow-SIMD back to Pillow #8

Contribute Pillow-SIMD back to Pillow #8

Comments

homm commented Jul 12, 2016

homm commented Jul 12, 2016 • edited

Compile-time decision

homm commented Jul 12, 2016

Different .so modules

homm commented Jul 12, 2016 • edited

Different functions in different .o files

socketpair commented Jul 12, 2016 • edited

cancan101 commented Jul 7, 2017

akx commented Jul 19, 2017

JGoutin commented Aug 19, 2017

grafi-tt commented Dec 30, 2017

JGoutin commented May 30, 2018

adam-azarchs commented Jul 11, 2018

grafi-tt commented Jul 11, 2018

QuLogic commented Mar 25, 2019

farleylai commented Nov 10, 2020

adam-azarchs commented Nov 10, 2020

QuLogic commented Jan 29, 2022

hameerabbasi commented Apr 5, 2024

homm commented Jul 12, 2016 •

edited

homm commented Jul 12, 2016 •

edited

socketpair commented Jul 12, 2016 •

edited