Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return of the loop vectorization compiler issue using gcc 11.2.0 on Windows MSYS2 #6384

Open
Thanatomanic opened this issue Nov 22, 2021 · 58 comments
Labels
priority: critical Urgently needs fixing scope: compilation Compilation issues type: bug Something is not doing what it's supposed to be doing

Comments

@Thanatomanic
Copy link
Contributor

I updated my MSYS2 installation today and lo and behold, the -fno-tree-loop-vectorize compiler bug is back. This time with a green color cast instead of magenta.

To reproduce
Fully update MSYS2, compile RT dev branch, open any Bayer file, apply Neutral, toggle 'Auto-correction' for RAW Chromatic Aberration Correction. Voilá, green pixels.

image
AboutThisBuild.txt

Temporary solution
Building with -DCMAKE_CXX_FLAGS='-fno-tree-loop-vectorize' removes the issue.

Can anyone confirm first before we may need to head back to the people upstream?
@heckflosse

@Thanatomanic Thanatomanic added type: bug Something is not doing what it's supposed to be doing priority: critical Urgently needs fixing scope: compilation Compilation issues labels Nov 22, 2021
@heckflosse
Copy link
Collaborator

I can not confirm the green cast, because my msys2 gcc 11.2 RT builds crash as soon as I open a raw :-(

@Thanatomanic
Copy link
Contributor Author

I cannot reproduce any longer... not sure what was going on. Closing for now.

@heckflosse
Copy link
Collaborator

@Thanatomanic

I cannot reproduce any longer... not sure what was going on. Closing for now.

But I can now reproduce it when built with gcc 11.2.0 ...
grafik

Reopening...

@heckflosse heckflosse reopened this Dec 6, 2021
@heckflosse
Copy link
Collaborator

heckflosse commented Dec 6, 2021

@Thanatomanic I already work on narrowing it...

Edit: It indeed looks like the old gcc issue, but this time with doubles instead of floats. Maybe they fixed it only for single precision floats... or they reinvented the bug because their regression test did only cover single precision float, who knows.

Anyway, I'm on it...

@heckflosse
Copy link
Collaborator

@Thanatomanic Applying this silly patch solves the issue (not that we want to apply it) and clearly shows, that there's an issue in gcc 11.2

diff --git a/rtengine/gauss.cc b/rtengine/gauss.cc
index 99201a860..a27fc5a9c 100644
--- a/rtengine/gauss.cc
+++ b/rtengine/gauss.cc
@@ -19,7 +19,7 @@
 #include <cmath>
 #include <cstdlib>
 #include <cstring>
-
+#include <iostream>
 #include "gauss.h"

 #include "boxblur.h"
@@ -651,7 +651,13 @@ template<class T> void gaussHorizontal (T** src, T** dst, const int W, const int
         for (int j = 0; j < 3; j++) {
             M[i][j] /= (1.0 + b1 - b2 + b3) * (1.0 + b2 + (b1 - b3) * b3);
         }
-
+#pragma omp single
+{
+    for (int i = 0; i < 3; i++)
+        for (int j = 0; j < 3; j++) {
+            std::cout << M[i][j] << std::endl;
+        }
+}
     double temp2[W] ALIGNED16;

 #ifdef _OPENMP

@heckflosse
Copy link
Collaborator

@Thanatomanic You need to enable raw ca correction with avoid color shift to trigger the bug

@heckflosse
Copy link
Collaborator

It's this loop, which causes the issue. It's (wrong) vectorized in gcc 11.2, but (correctly) not vectorized in gcc 10.3

https://github.com/Beep6581/RawTherapee/blob/dev/rtengine/gauss.cc#L667

Here's a code snippet to check with godbolt.org...

#include <cmath>

template<class T> void calculateYvVFactors( const T sigma, T &b1, T &b2, T &b3, T &B, T M[3][3])
{
    // coefficient calculation
    T q;

    if (sigma < 2.5) {
        q = 3.97156 - 4.14554 * sqrt (1.0 - 0.26891 * sigma);
    } else {
        q = 0.98711 * sigma - 0.96330;
    }

    T b0 = 1.57825 + 2.44413 * q + 1.4281 * q * q + 0.422205 * q * q * q;
    b1 = 2.44413 * q + 2.85619 * q * q + 1.26661 * q * q * q;
    b2 = -1.4281 * q * q - 1.26661 * q * q * q;
    b3 = 0.422205 * q * q * q;
    B = 1.0 - (b1 + b2 + b3) / b0;

    b1 /= b0;
    b2 /= b0;
    b3 /= b0;

    // From: Bill Triggs, Michael Sdika: Boundary Conditions for Young-van Vliet Recursive Filtering
    M[0][0] = -b3 * b1 + 1.0 - b3 * b3 - b2;
    M[0][1] = (b3 + b1) * (b2 + b3 * b1);
    M[0][2] = b3 * (b1 + b3 * b2);
    M[1][0] = b1 + b3 * b2;
    M[1][1] = -(b2 - 1.0) * (b2 + b3 * b1);
    M[1][2] = -(b3 * b1 + b3 * b3 + b2 - 1.0) * b3;
    M[2][0] = b3 * b1 + b2 + b1 * b1 - b2 * b2;
    M[2][1] = b1 * b2 + b3 * b2 * b2 - b1 * b3 * b3 - b3 * b3 * b3 - b3 * b2 + b3;
    M[2][2] = b3 * (b1 + b3 * b2);

}

template<class T> void gaussHorizontal (const T* const* src, T** dst, const int W, const int H, const double sigma)
{
    
    double b1, b2, b3, B, M[3][3];
    calculateYvVFactors<double>(sigma, b1, b2, b3, B, M);

    for (int i = 0; i < 3; i++)
        for (int j = 0; j < 3; j++) {
            M[i][j] /= (1.0 + b1 - b2 + b3) * (1.0 + b2 + (b1 - b3) * b3);
        }

    double temp2[W] ;

#ifdef _OPENMP
    #pragma omp for
#endif

    for (int i = 0; i < H; i++) {

        temp2[0] = B * src[i][0] + b1 * src[i][0] + b2 * src[i][0] + b3 * src[i][0];
        temp2[1] = B * src[i][1] + b1 * temp2[0]  + b2 * src[i][0] + b3 * src[i][0];
        temp2[2] = B * src[i][2] + b1 * temp2[1]  + b2 * temp2[0]  + b3 * src[i][0];

        for (int j = 3; j < W; j++) {
            temp2[j] = B * src[i][j] + b1 * temp2[j - 1] + b2 * temp2[j - 2] + b3 * temp2[j - 3];
        }

        double temp2Wm1 = src[i][W - 1] + M[0][0] * (temp2[W - 1] - src[i][W - 1]) + M[0][1] * (temp2[W - 2] - src[i][W - 1]) + M[0][2] * (temp2[W - 3] - src[i][W - 1]);
        double temp2W   = src[i][W - 1] + M[1][0] * (temp2[W - 1] - src[i][W - 1]) + M[1][1] * (temp2[W - 2] - src[i][W - 1]) + M[1][2] * (temp2[W - 3] - src[i][W - 1]);
        double temp2Wp1 = src[i][W - 1] + M[2][0] * (temp2[W - 1] - src[i][W - 1]) + M[2][1] * (temp2[W - 2] - src[i][W - 1]) + M[2][2] * (temp2[W - 3] - src[i][W - 1]);

        temp2[W - 1] = temp2Wm1;
        temp2[W - 2] = B * temp2[W - 2] + b1 * temp2[W - 1] + b2 * temp2W + b3 * temp2Wp1;
        temp2[W - 3] = B * temp2[W - 3] + b1 * temp2[W - 2] + b2 * temp2[W - 1] + b3 * temp2W;

        for (int j = W - 4; j >= 0; j--) {
            temp2[j] = B * temp2[j] + b1 * temp2[j + 1] + b2 * temp2[j + 2] + b3 * temp2[j + 3];
        }

        for (int j = 0; j < W; j++) {
            dst[i][j] = (T)temp2[j];
        }

    }
}

void test(float** src, float** dst, const int W, const int H, const double sigma) {
    gaussHorizontal (src, dst, W, H, sigma);
}

@Thanatomanic
Copy link
Contributor Author

@heckflosse I've temporarily patched CMakeLists to force the vectorization to be off. Would you be interested to take this upstream again?

@heckflosse
Copy link
Collaborator

@Thanatomanic please push

Desmis added a commit that referenced this issue Dec 21, 2021
* Gui improvments

* Several improvments GUI Jz algo

* Change function La for lightess Jz

* SH jzazbz first

* enable Jz SH

* Clean code

* Disabled Munsell correction when Jz

* Change tooltip and Cam16 Munsell

* GUI for CzHz and HzHz curves

* Enable curves Hz(Hz) Cz(Hz)

* Improve Cz chroma

* Jz100 reference refine

* Change limit Jz100

* Refine link between jz100 and peak adaptation

* Improve GUI

* Various improvment PQ PU gamut

* Change defaults settings

* forgotten PL in gamutjz

* Small changes and comment

* Change gamujz parameter

* disabled gamut Jz too slow

* Jzazbz curve Jz(Hz)

* reenable gamutjz

* small changes

* Change tooltip

* Change labels tooltips

* Jzazbz only on advanced mode

* GUI improvments

* Change tooltip

* Change default values and tooltip

* Added tooltip Jz

* Disabled Jz gamut

* Change gamma color and light - remove exposure

* Gamma for exposure and DR

* gamma Sharp

* Gamma vibrance

* gamma optimizations

* Change tooltips

* Optimization PQ

* LA GUI for tone curve Ciecam

* LA ciecam Enable curve lightness - brightness

* LA ciecam GUI color curve

* LA ciecam enable color curve

* Change tooltip and default values

* Enable Jz curve

* Enable Cz(Cz) curve

* Enable Cz(Jz) curve

* Added Log encoding to ciecam

* Improvment algorithm remapping

* Reenable forgotten listener logencodchanged

* Change Jz tooltips

* Reenable dynamic range and exposure

* First change GUI auto ciecam

* 2nd fixed ciecam auto

* Improve GUI maskbackground curves

* Enable activspot for la ciecam

* set sensitive sliders La ciecam when auto scene conditions

* Change internal calculations see comments

* Checcbox ForceJz to 1

* Change tool position - change order CAM model

* Expander for Jzczhz

* Remove unused code

* GUI changes

* Change labels CAM16 Jzazbz

* Change slider brightness parameters

* improvment SH jz

* Some changes to brightness Jz

* Fixed scene conditions auto

* Renable forgotten change

* Prepare calculation Zcam

* Prepare Iz for zcam

* First GUI Zcam

* Improve GUI Zcam

* Calculate Qz white - brightness of the reference white

* Prepare for PQ - eventually

* Init LUT ZCAMBrightCurveJz and ZCAMBrightCurveQz

* prepare zcam achromatic variables

* First zcam

* Change algo step 5 zcam

* Another change original algo

* Another change to original algo

* first colorfullness

* Fixed bad behavior threshold and change c c2 surround parameters

* added saturation Zcam

* Change parameters surround

* Enable chroma zcam

* change chroma and lightness formula

* disable OMP for 2nd process Zcam

* Improvment zcam for some high-light images

* Change parameters overflow zcam

* Change parmeters high datas

* another change to retrieve...

* Simplify code matrix conversion xyz-jzazbz

* Adjust internam parameters zcam

* Change some parameters - clean code

* Enable PQCam16

* Enable PQ Cam16 - disable ZCAM

* remove warning compilation message

* Change GUI jzczhz

* Fixed bad behavior remaping jz

* Remove forgotten parameter - hide Jz100 - PU adaptation- chnage tooltips

* Another change to chroma parameter

* Small changes

* If verbose display in console Cam16 informations

* If verbose display in console source saturation colorfullness

* Change to La calculation for ciecam

* Change GUI cam16 - jzczhz - remove cam16 and jzczhz

* Disable exposure compensation to calculate La for all Ciecam and Log encoding

* Change label Cam16 and jzczhz

* Improve GUI Jz

* Other improvment GUI Jz Cam16

* verify nan Jz and ciecam matrix to avoid crash

* Enable La manual for Jz to change PU-adaptation

* Improve calculation to avoid crash Jz and Cam16 matrix

* Fixed crash with local contrast in cam16

* Clean code loccont

* First step GUI Cie mask

* GUI part 2 - Cie

* Build cieMask

* Gui part 3 cie

* Valid llcieMask

* Valid llcieMask

* Pass GUI curves parameters to iplocallab.cc

* 2nd pass parameters from GUI to iplocallab.cc

* Init first functions modifications

* Add expander to cam16 adjustments

* First test mask cie

* Various improvment GUI - tooltips - process

* Take into account Yb cam16 for Jz - reenable warm-cool

* Surround source Cam16 before Jz

* Improve GUI and process

* Fixed bug and bad behavior last commit

* Fixed bug chroma mask - improve GUI - Relative luminance for Jz

* Increase sensitivity mask chroma

* Improve Jz with saturation Z - improve GUI Jzczhz

* Small code improvment

* Another change mask C and enable mask for Cam16 and Jz

* Some changes

* Enable denoise chroma mask

* Small change LIM01 normchromar

* Enable Zcam matrix

* Improve chroma curves...mask and boudaries

* take into account recursive slider in settings

* Change tooltip - improvment to C curve (denoise C - best value in curves - etc.) - remove Zcam button

* Change tooltips

* First part GUI - local contrast wavelet Jz

* Passed parameters GUI local contrast wav jz to rtengine

* save config wavelet jz

* first try wavelet local contrast Jz

* Add tooltips

* Simplify code wavelet local contrast

* take into account edge wavelet performance in Wavelet Jz

* Fixed overflow jz when usig botth contradt and wavelt local jz contrast

* Adapt size winGdiHandles in filepanel to avoid crash in Windows multieditor

* First GUI part Clarity wavelet Jz

* First try wavelet Jz Cz clarity

* Added tooltips

* Small change to enable wavelet jz

* Disabled (commented) all Zcam code

* Improve behavior when SH local-contrast and Clarity are use both

* Change limit PQremap jz

* Clean and optimize code

* Reenable mjjz

* Change settings guidedfilter wavelet Jz

* Fixed crash when revory based on lum mask negative

* Change tooltip

* Fixed ad behavior auto mean and absolute luminance

* Remove warning in console

* Fixed bad behavior auto Log encoding - bad behavior curves L(H) Jz

* Fixed another bad behavior - reenable curves color and light L(H) C(H)

* first transposition Lab Jz for curves H

* Change mask boundary for Jz

* Various improvment to H curves Jz

* Add amountchrom to Hcurve Color and Light

* Improve gray boundary curves behavior

* reenable Jz curve H(H) - soft radius

* Improve guidefilter Jz H curve

* Threshold chroma Jz(Hz)

* Enable guidedfilter chroma curve H

* improve GUI curves Hz

* Checkbutton chroma for curve Jz(Hz)

* Change event selectspot

* Clean and small optimization code

* Another clean code

* Change calculation Hz references for curves Hz

* Clean code

* Various changes to GF and GUI

* Another change to Chroma for Jz(H)

* Change GUI sensitive Jz100 adapdjzcie

* Improve code for Jz100

* Change default value skin-protection to 0 instead of 50

* Clean code

* Remove BENCHFUN for ciecam

* small correction to huejz_to_huehsv2 conversion

* Added missing plum parameter for jch2xyz_ciecam02float

* another small change to huejz_to_huehsv2

* Improvment to huelab_to_huehsv2 and some double functions

* Fixed warning hide parameters in lgtm-com

* Fixed ? Missing retuen statement in lgtm-com

* Change behavior Log encoding whith PQ Cam16

* Small improvment to Jz PU adaptation

* Added forgoten to_one for Cz slider

* Replace 0.707... by RT_SQRT1_2 - change some settings chroma

* Improvment to getAutoLogloc

* Fixed crash with array in getAutoLogloc

* First try Jz Log encoding

* Forgotten Cz

* Various improvment GUI setlogscale - Jz log encoding

* Change labels tooltips Jz log

* Change wrong clipcz value

* Change tooltip auto scene conditions

* Fixed bad behavior blackevjz whiteevjz

* Small improvment to LA Log encoding std

* Avoid bad behavior Jz log when enable Relative luminance

* Change sourcegray jz calculation

* Revert last change

* Clean and comment code

* Review tooltips thanks to Wayne - harmonize response La log encoding and Jz Log encoding

* Always force Dynamic Range evaluation in full frame mode for Jz log encoding

* Remove unused code

* Small optimizations sigmoid Cam16 and Jz

* Comment code

* Change parameters deltaE for HDR

* Various improvment to Jz - La - sigmoid - log encoding

* Basic support for Sony ILCE-7M4 in camconst.json

* German translation Spot Removal (#6388)

* Filmnegative German translation (#6389)

* (Temporarily) disable `ftree-loop-vectorize` for GCC 11 because of #6384

* Added BlacEv WhiteEv to sigmoidJz

* Improve GUI for BlackEv WhiteEv

* Change location SigmoidJz in Iplocallab

* Improvment GUI and sensitivity sliders strength sigmoid

* Change labels

Co-authored-by: Thanatomanic <6567747+Thanatomanic@users.noreply.github.com>
Co-authored-by: Anna <simonanna@gmx.net>
@Lawrence37
Copy link
Collaborator

@Thanatomanic I see you've disabled the vectorization for gcc 11. Did you observe the same problem with earlier versions of gcc 11, not just 11.2?

@Thanatomanic
Copy link
Contributor Author

@Lawrence37 I suspect earlier versions of the 11 branch have never been publicly available for MSYS2. At least, they're not in the public repo: http://repo.msys2.org/mingw/x86_64/
And judging from the output of Godbolt with the code Ingo provided above, I see no difference between 11.1 and 11.2.

@Thanatomanic
Copy link
Contributor Author

MSYS2 updated to GCC 11.3.0 but the bug remains.

@Floessie
Copy link
Collaborator

@Thanatomanic I've built RT without -fno-tree-loop-vectorize on a Debian Testing AMD64 with GCC 12.1 and followed your steps in the first post. Seems like the bug is gone. 🎉

@Lawrence37
Copy link
Collaborator

I was hoping to test if GCC 12.1.0 fixes the issue, but I'm not able to reproduce the bug on GCC 11.2.0/11.3.0 even with the same commit @Thanatomanic used.

AboutThisBuild.txt
Version: 5.8-3022-gb1e7860a2
Branch: 5.8-3022-gb1e7860a2
Commit: b1e7860a2
Commit date: 2021-08-09
Compiler: cc 11.2.0
Processor: undefined
System: Windows
Bit depth: 64 bits
Gtkmm: V3.24.6
Lensfun: V0.3.3.0
Build type: release
Build flags:  -std=c++11 -march=native -Werror=unused-label -Werror=delete-incomplete -fno-math-errno -Wno-attributes -Wall -Wuninitialized -Wcast-qual -Wno-deprecated-declarations -Wno-unused-result -Wunused-macros -fopenmp -Werror=unknown-pragmas -O3 -DNDEBUG -ftree-vectorize
Link flags:  -march=native
OpenMP support: ON
MMAP support: ON
Build OS: MINGW64_NT-10.0-19042 3.3.5-341.x86_64 x86_64
Build date: Sat, 11 Jun 2022 06:17:13 +0000 UTC
Build epoch: 1654928233
Build UUID: abee8c39-c6c5-4aad-b411-f404a9c6c29c

amsterdam.pef.pp3.txt

@Thanatomanic
Copy link
Contributor Author

For me it is trivial to reproduce in a fully updated MSYS2 using GCC 12.1.0 and removing -fno-tree-loop-vectorize. Just apply Neutral and switch on Auto-correction for raw CA Correction. Boom, green stuff.

@Lawrence37
Copy link
Collaborator

That's interesting. I upgraded to GCC 12.1.0 and now the image is green.

@Thanatomanic
Copy link
Contributor Author

@ValZapod I just tested on Windows, but 12.2 still has the same issue.

I don't have the courage to file a bug report with the GCC devs...

@Lawrence37
Copy link
Collaborator

Was anyone able to reproduce the vectorization bug with the code snippet provided by @heckflosse? I tried with different inputs but they all resulted in the expected output. I was hoping to use the snippet as a starting point for creating a minimum example for a bug report.

@Desmis
Copy link
Collaborator

Desmis commented Sep 3, 2022

@Thanatomanic @Lawrence37

Hello

Excuse my questions, but for me Msys2, git, etc. it's Chinese, I scrupulously transcribe the instructions, but I don't understand (or misunderstand) what I'm doing

For health reasons (which unfortunately are still present) I haven't updated Msys2 for a long time. I'm on « gcc 10.3 »

I see there are problems with gcc versions 11 and 12

Can I update Msys2 using ? :
pacman -Syu or pacman -Syuu

Or should I wait ?

I see in cmakelist.txt

if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU" AND ((CMAKE_CXX_COMPILER_VERSION VERSION_GREATER "10.0" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS "10.2") OR (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL "11.0")))
message(STATUS "WARNING: gcc ${CMAKE_CXX_COMPILER_VERSION} is known to miscompile RawTherapee when using -ftree-loop-vectorize, forcing the option to be off")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fno-tree-loop-vectorize")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-tree-loop-vectorize")
endif()

In case it doesn't work (I'm on Windows 8, my machine is very old... 11 years, I had to change it at the beginning of 2022, but...). What are the instructions to be entered line by line either :

  • to uninstall the 12.x version which does not work ?
  • to install a working version 11.y ?
  • or to reinstall version 10.3 which works perfectly ?

I will keep these instructions carefully for later use.

Thank you for this educational information.

Jacques

@Thanatomanic
Copy link
Contributor Author

@Desmis It is safe to upgrade your MSYS2 environment and upgrade GCC to the latest version. The code in CMakeList prevents the bug from happening and will only affect processing speed minimally (at least, I have not noticed it).

So, you can just open the MSYS2 MinGW 64-bit console and type pacman -Syu. Accept the changes, restart the console if required, and run the command again, until it says there are no more updates available.

@Desmis
Copy link
Collaborator

Desmis commented Sep 3, 2022

@Thanatomanic
Ok, thank you
I will change tomorrow
Jacques

@Thanatomanic
Copy link
Contributor Author

Was anyone able to reproduce the vectorization bug with the code snippet provided by @heckflosse? I tried with different inputs but they all resulted in the expected output.

@Lawrence37 Not explicitly, but it is fairly hard to test anyway because the code requires src input of WxH floats, i.e. the entire image. What I did to investigate for myself (again) where the issue is, is the following:

// fast gaussian approximation if the support window is large
template<class T> void gaussHorizontal (T** src, T** dst, const int W, const int H, const double sigma)
{
    double b1, b2, b3, B, M[3][3];
    calculateYvVFactors<double>(sigma, b1, b2, b3, B, M);

    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < 3; j++) {
            M[i][j] /= (1.0 + b1 - b2 + b3) * (1.0 + b2 + (b1 - b3) * b3);
        }
    }

    double temp2[W] ALIGNED16;
    
    printf("src=[%g %g %g %g %g\n",src[0][0],src[0][1],src[0][2],src[0][3],src[0][4]);
    printf("%g %g %g %g %g\n",src[1][0],src[1][1],src[1][2],src[1][3],src[1][4]);
    printf("%g %g %g %g %g\n",src[2][0],src[2][1],src[2][2],src[2][3],src[2][4]);
    printf("%g %g %g %g %g\n",src[3][0],src[3][1],src[3][2],src[3][3],src[3][4]);
    printf("%g %g %g %g %g]\n",src[4][0],src[4][1],src[4][2],src[4][3],src[4][4]);

#ifdef _OPENMP
    #pragma omp for
#endif

    for (int i = 0; i < H; i++) {

        temp2[0] = B * src[i][0] + b1 * src[i][0] + b2 * src[i][0] + b3 * src[i][0];
        temp2[1] = B * src[i][1] + b1 * temp2[0]  + b2 * src[i][0] + b3 * src[i][0];
        temp2[2] = B * src[i][2] + b1 * temp2[1]  + b2 * temp2[0]  + b3 * src[i][0];

        for (int j = 3; j < W; j++) {
            temp2[j] = B * src[i][j] + b1 * temp2[j - 1] + b2 * temp2[j - 2] + b3 * temp2[j - 3];
        }

        double temp2Wm1 = src[i][W - 1] + M[0][0] * (temp2[W - 1] - src[i][W - 1]) + M[0][1] * (temp2[W - 2] - src[i][W - 1]) + M[0][2] * (temp2[W - 3] - src[i][W - 1]);
        double temp2W   = src[i][W - 1] + M[1][0] * (temp2[W - 1] - src[i][W - 1]) + M[1][1] * (temp2[W - 2] - src[i][W - 1]) + M[1][2] * (temp2[W - 3] - src[i][W - 1]);
        double temp2Wp1 = src[i][W - 1] + M[2][0] * (temp2[W - 1] - src[i][W - 1]) + M[2][1] * (temp2[W - 2] - src[i][W - 1]) + M[2][2] * (temp2[W - 3] - src[i][W - 1]);

        temp2[W - 1] = temp2Wm1;
        temp2[W - 2] = B * temp2[W - 2] + b1 * temp2[W - 1] + b2 * temp2W + b3 * temp2Wp1;
        temp2[W - 3] = B * temp2[W - 3] + b1 * temp2[W - 2] + b2 * temp2[W - 1] + b3 * temp2W;

        for (int j = W - 4; j >= 0; j--) {
            temp2[j] = B * temp2[j] + b1 * temp2[j + 1] + b2 * temp2[j + 2] + b3 * temp2[j + 3];
        }

        for (int j = 0; j < W; j++) {
            dst[i][j] = (T)temp2[j];
        }

    }
    
    printf("dst=[%g %g %g %g %g\n",dst[0][0],dst[0][1],dst[0][2],dst[0][3],dst[0][4]);
    printf("%g %g %g %g %g\n",dst[1][0],dst[1][1],dst[1][2],dst[1][3],dst[1][4]);
    printf("%g %g %g %g %g\n",dst[2][0],dst[2][1],dst[2][2],dst[2][3],dst[2][4]);
    printf("%g %g %g %g %g\n",dst[3][0],dst[3][1],dst[3][2],dst[3][3],dst[3][4]);
    printf("%g %g %g %g %g]\n",dst[4][0],dst[4][1],dst[4][2],dst[4][3],dst[4][4]);
}

So, simply check the input and output of both arrays. I then configure and compile RawTherapee as follows:
$ cmake -G Ninja -DLENSFUNDBDIR=share/lensfun/version_1 -DCMAKE_BUILD_TYPE=Release -DPROC_TARGET_NUMBER=2 -DCACHE_NAME_SUFFIX="5-dev" -DOPTION_OMP=OFF -DCMAKE_CXX_FLAGS_RELEASE="-O2 -DNDEBUG" .. && ninja install. So, no multithreading and optimized only with -O2. The bug does not show in this case, and when opening the well-known amsterdam.pef file with 'Neutral' and CA on, gives (first lines):

src=[0.981189 0.977433 0.976948 0.96714 0.994864
0.996065 0.987361 0.974853 0.990726 1
1 1 0.997818 1 1
0.994804 0.987771 0.991381 0.985275 1
1 1 0.997599 1 1]
dst=[0.976641 0.976387 0.976124 0.975851 0.975569
0.9911 0.990885 0.990665 0.99044 0.990211
0.996854 0.996753 0.996652 0.996551 0.99645
0.994516 0.994489 0.99446 0.994431 0.994402
0.99579 0.995581 0.995366 0.995145 0.994918]
src=[1.00165 1.00649 1 1.00209 1.00736
1.00052 1 1 1.00604 1
1.00122 1.00323 1.00097 1.00241 1.00349
1 1.00083 1 1.00205 1.00532
1.00305 1 1 1.00352 1]
dst=[1.0003 1.00023 1.00015 1.00007 0.99999
1.00065 1.00066 1.00067 1.00068 1.00069
1.00031 1.00027 1.00022 1.00017 1.00012
1.00029 1.00028 1.00026 1.00024 1.00023
1.00194 1.00192 1.00191 1.0019 1.00189]
src=[0.994839 0.985878 0.99088 0.981108 1.00913
1.00965 1.00099 0.988478 1.00118 1.01434
1.01042 1.01103 1.01148 1.01387 1.01405
1.0063 0.991715 1.00467 0.994901 1.01256
1.01046 1.01296 1.01069 1.01128 1.00879]
dst=[0.991704 0.9915 0.991286 0.991064 0.990832
1.00613 1.00598 1.00582 1.00567 1.00551
1.01066 1.01069 1.01073 1.01076 1.0108
1.00808 1.00814 1.0082 1.00826 1.00832
1.00789 1.00772 1.00755 1.00737 1.00718]

Doing the same, but with -O3 clearly shows something goes wrong in accessing the arrays:

src=[0.981189 0.977433 0.976948 0.96714 0.994864
0.996065 0.987361 0.974853 0.990726 1
1 1 0.997818 1 1
0.994804 0.987771 0.991381 0.985275 1
1 1 0.997599 1 1]
dst=[nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan nan]
src=[1.00165 1.00649 1 1.00209 1.00736
1.00052 1 1 1.00604 1
1.00122 1.00323 1.00097 1.00241 1.00349
1 1.00083 1 1.00205 1.00532
1.00305 1 1 1.00352 1]
dst=[nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan nan]
src=[0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5]
dst=[0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5]

@Desmis
Copy link
Collaborator

Desmis commented Sep 4, 2022

@Thanatomanic

I run with Msys2
Pacman -Syu 2 times
All seems to work well....loading, verifying, etc.

But when I run
Cmake
I get this message

jacques@pc-bureau MINGW64 /g/code/repo-rt/build
$ cmake -G "Ninja" -DLENSFUNDBDIR=share/lensfun -DCMAKE_BUILD_TYPE="release" -DPROC_TARGET_NUMBER="2" -DCACHE_NAME_SUFFIX="5-dev" ..
-- WARNING: gcc 12.2.0 is known to miscompile RawTherapee when using -ftree-loop-vectorize, forcing the option to be off
-- CMAKE_BUILD_TYPE: release
-- Configuring done
-- Generating done
-- Build files have been written to: G:/code/repo-rt/build

jacques@pc-bureau MINGW64 /g/code/repo-rt/build
$ cmake --build . --target install
ninja: error: 'C:/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/10.3.0/libgomp.dll.a', needed by 'rtgui/rawtherapee.exe', missing and no known rule to make it

Can you help me ?

Jacques

@Thanatomanic
Copy link
Contributor Author

@Desmis Was your build directory empty before you started compilation? It seems it tries to find an old version of the file.

@Desmis
Copy link
Collaborator

Desmis commented Sep 4, 2022

No I will try with another build

Jacques

@Desmis
Copy link
Collaborator

Desmis commented Sep 4, 2022

@Thanatomanic
I just build a new repo

The system compile... now

I will verify after, if it run well.

Thank you

Jacques

@Thanatomanic
Copy link
Contributor Author

Some additional information on the bug: it seems there is a explosive loss of precision and propagation of errors while performing this loop in -O3:

        for (int j = 3; j < W; j++) {
            temp2[j] = B * src[i][j] + b1 * temp2[j - 1] + b2 * temp2[j - 2] + b3 * temp2[j - 3];
        }

Under -O2 conditions, I have, for example:

0.981189 0.981188 0.981186 0.981182 0.981177 0.981172 0.981165 0.981156 0.981145 0.981131
0.981113 0.981093 0.981069 0.98104 0.981007 0.980972 0.980936 0.980898 0.980857 0.980815
0.980769 0.980723 0.980678 0.980635 0.980592 0.980551 0.980512 0.980476 0.980442 0.980407
0.980372 0.980338 0.980306 0.980276 0.980247 0.980218 0.980189 0.98016 0.980129 0.980095
0.980058 0.98002 0.979979 0.979937 0.979892 0.979844 0.979794 0.979743 0.979688 0.97963
0.979565 0.979488 0.979391 0.979252 0.97903 0.978697 0.978265 0.977723 0.97707 0.976294
0.9754 0.974418 0.973394 0.972421 0.971537 0.970741 0.969987 0.969251 0.968543 0.967882
0.967292 0.966771 0.966322 0.965877 0.965389 0.964852 0.96426 0.963621 0.962931 0.962165
0.961295 0.960294 0.95916 0.957905 0.956575 0.955216 0.953847 0.952483 0.951131 0.949792 (...)

But when the bug appears, the numbers only are the same up to the ~40th index and it goes from bad to worse quickly:

0.981189 0.981188 0.981186 0.981182 0.981177 0.981172 0.981165 0.981156 0.981145 0.981131
0.981113 0.981093 0.981069 0.98104 0.981007 0.980972 0.980936 0.980898 0.980857 0.980815
0.980769 0.980723 0.980678 0.980635 0.980592 0.980551 0.980512 0.980476 0.980442 0.980407
0.980372 0.980338 0.980306 0.980276 0.980247 0.980218 0.980189 0.98016 0.980129 0.980095
0.980057 0.980019 0.979979 0.979938 0.979896 0.979844 0.979782 0.979715 0.979688 0.979697
0.979734 0.979488 0.978982 0.978223 0.97903 0.981184 0.984519 0.977723 0.961951 0.938278
0.9754 1.06632 1.20449 0.972421 0.412852 -0.434079 0.969987 4.36542 9.50824 0.967882
-19.6775 -50.9448 0.966322 126.463 316.529 0.964852 -761.913 -1917.3 0.962931 4638.38
11661.8 0.960294 -28189.2 -70883.7 0.956575 171365 430898 0.952483 -1.0417e+06 -2.61936e+06 (...)

@Desmis
Copy link
Collaborator

Desmis commented Sep 4, 2022

@Thanatomanic

My system hangs here...
I try 2 clones

[266/300] Building CXX object rtgui/CMakeFiles/rth.dir/splash.cc.obj
[267/300] Building CXX object rtgui/CMakeFiles/rth.dir/softlight.cc.obj
[268/300] Building CXX object rtgui/CMakeFiles/rth.dir/threadutils.cc.obj
[269/300] Building CXX object rtgui/CMakeFiles/rth.dir/spot.cc.obj
[270/300] Building CXX object rtgui/CMakeFiles/rth.dir/thumbbrowserbase.cc.obj
[271/300] Building CXX object rtgui/CMakeFiles/rth.dir/thumbbrowserentrybase.cc.obj
[272/300] Building CXX object rtgui/CMakeFiles/rth.dir/thresholdselector.cc.obj
[273/300] Building CXX object rtgui/CMakeFiles/rth.dir/thumbimageupdater.cc.obj
[274/300] Building CXX object rtgui/CMakeFiles/rth.dir/thresholdadjuster.cc.obj
[275/300] Building CXX object rtgui/CMakeFiles/rth.dir/toolbar.cc.obj
[276/300] Building CXX object rtgui/CMakeFiles/rth.dir/tonecurve.cc.obj
[277/300] Building CXX object rtgui/CMakeFiles/rth.dir/thumbnail.cc.obj
[278/300] Building CXX object rtgui/CMakeFiles/rth.dir/toolpanel.cc.obj
[279/300] Building CXX object rtgui/CMakeFiles/rth.dir/vibrance.cc.obj
[280/300] Building CXX object rtgui/CMakeFiles/rth-cli.dir/alignedmalloc.cc.obj
[281/300] Building CXX object rtgui/CMakeFiles/rth.dir/whitebalance.cc.obj
[282/300] Building RC object rtgui/CMakeFiles/rth-cli.dir/myicon.rc.obj
[283/300] Building CXX object rtgui/CMakeFiles/rth.dir/vignetting.cc.obj
[284/300] Building CXX object rtgui/CMakeFiles/rth.dir/xtransprocess.cc.obj
[285/300] Building CXX object rtgui/CMakeFiles/rth.dir/toolpanelcoord.cc.obj
[286/300] Building CXX object rtgui/CMakeFiles/rth.dir/xtransrawexposure.cc.obj
[287/300] Building CXX object rtgui/CMakeFiles/rth-cli.dir/editcallbacks.cc.obj
[288/300] Building CXX object rtgui/CMakeFiles/rth.dir/zoompanel.cc.obj
[289/300] Building CXX object rtgui/CMakeFiles/rth-cli.dir/pathutils.cc.obj
[290/300] Building CXX object rtgui/CMakeFiles/rth-cli.dir/multilangmgr.cc.obj
[291/300] Building CXX object rtgui/CMakeFiles/rth-cli.dir/threadutils.cc.obj
[292/300] Building CXX object rtgui/

Jacques

@Desmis
Copy link
Collaborator

Desmis commented Sep 4, 2022

@Thanatomanic
Excuse me, now it works

My computer is very very old, and it is very hot...

I think all works fine

Thank you again

Jacques

@Thanatomanic
Copy link
Contributor Author

It does not hang, just give it time. This is an unfortunate downside of the new GCC, see here: #6548

@Desmis
Copy link
Collaborator

Desmis commented Sep 4, 2022

OK thank you

Jacques

@Thanatomanic
Copy link
Contributor Author

@ZaquL It is not. There is a proposed solution, but it has not been fixed in the main branch.

@Lawrence37
Copy link
Collaborator

Lawrence37 commented Sep 5, 2022

That was helpful @Thanatomanic. It turns out the -march=native flag is necessary to reproduce the bug. I narrowed it down to the -mfma flag which enables fused multiply-add instructions. Compiling the below code with g++ -mfma -O3 and running it shows an incorrect result sampled from the middle of the 1D "image". The expected value is 0.905017 but it outputs -415762.

This is the simplest program I can make which shows the problem. Any other reduction yields correct results. I wish it could be reduced further, but if it's not possible, then I guess I'll use this example to open a bug report with the GCC folks.

/*
 *  This file is part of RawTherapee.
 *
 *  Copyright (c) 2004-2010 Gabor Horvath <hgabor@rawtherapee.com>
 *
 *  RawTherapee is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  RawTherapee is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with RawTherapee.  If not, see <https://www.gnu.org/licenses/>.
 */
#include <algorithm>
#include <cmath>
#include <iostream>
#include <vector>

template<class T> void calculateYvVFactors( const T sigma, T &b1, T &b2, T &b3, T &B, T M[3][3])
{
    // coefficient calculation
    T q;

    if (sigma < 2.5) {
        q = 3.97156 - 4.14554 * sqrt (1.0 - 0.26891 * sigma);
    } else {
        q = 0.98711 * sigma - 0.96330;
    }

    T b0 = 1.57825 + 2.44413 * q + 1.4281 * q * q + 0.422205 * q * q * q;
    b1 = 2.44413 * q + 2.85619 * q * q + 1.26661 * q * q * q;
    b2 = -1.4281 * q * q - 1.26661 * q * q * q;
    b3 = 0.422205 * q * q * q;
    B = 1.0 - (b1 + b2 + b3) / b0;

    b1 /= b0;
    b2 /= b0;
    b3 /= b0;

    for (int i = 0; i < 9; i++) {
        M[i/3][i%3] = 0;
    }
}

// fast gaussian approximation if the support window is large
template<class T> void gaussHorizontal (T* src, T* dst, const int W, const double sigma)
{
    double b1, b2, b3, B, M[3][3];
    calculateYvVFactors<double>(sigma, b1, b2, b3, B, M);

    for (int i = 0; i < 3; i++)
        for (int j = 0; j < 3; j++) {
            M[i][j] /= (1.0 + b1 - b2 + b3) * (1.0 + b2 + (b1 - b3) * b3);
        }

    double temp2[W];
    std::fill(temp2, temp2 + W, 0);

    for (int j = 3; j < W; j++) {
        // FIXME: Bug is here!
        temp2[j] = B * src[j] + b1 * temp2[j - 1] + b2 * temp2[j - 2] + b3 * temp2[j - 3];
    }

    for (int j = W - 4; j >= 0; j--) {
        // FIXME: and/or here!
        temp2[j] = B * temp2[j] + b1 * temp2[j + 1] + b2 * temp2[j + 2] + b3 * temp2[j + 3];
    }

    for (int j = 0; j < W; j++) {
        dst[j] = (T)temp2[j];
    }
}

template<class T> void gaussianBlurImpl(T* src, T* dst, const int W, const double sigma)
{
    gaussHorizontal<T> (src, dst, W, sigma);
    
    double b1, b2, b3, B, M[3][3];
    calculateYvVFactors<double>(sigma, b1, b2, b3, B, M);
}

void gaussianBlur(float* src, float* dst, const int W, const double sigma)
{
    gaussianBlurImpl<float>(src, dst, W, sigma);
}

int main() {
    constexpr int w = 100;
    std::vector<float> src_data(w, 1);
    std::vector<float> dst_data(w);

    gaussianBlur(src_data.data(), dst_data.data(), w, 30.0);

    std::cout << dst_data[w/2] << std::endl;

    return 0;
}

Edit: It does appear to be a loss of precision problem. Increasing the size of the "image" means more values need to be propagated which results in worse numerical results.

@Thanatomanic
Copy link
Contributor Author

@Lawrence37 Thanks for the code. I can confirm that this returns correct values with both -O2 -mfma and -O3 -mfma on Godbolt until GCC 11.1. That seems the point where the compiler changed - for the worse, for us.

However, now that we seem to have traced the origin, I would be surprised that this is actually the only place where such a loss of precision happens. I have the impression that code like this is all over the place... 🤔

@Lawrence37
Copy link
Collaborator

Maybe the reason why we are not seeing the problem elsewhere is because these three conditions must be satisfied:

  1. The code must be a candidate for FMA instructions. It may even be the case that only certain FMA situations are buggy.
  2. The precision lost is compounded. Here, the result of one calculation is used as the input of the following one and so on in a chain.
  3. GCC optimizes the surrounding code in a specific way. In the example I provided, there is a lot of irrelevant code. Remove them and you'll see it magically gets compiled correctly.

If we are confident that FMA is the problem, we should change the -fno-tree-loop-vectorize to -mno-fma.

@Lawrence37
Copy link
Collaborator

Made a bug report here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

@Lawrence37
Copy link
Collaborator

@ValZapod I checked with a less trimmed-down version of the code and the bug remains. Hopefully the GCC people are able to find the cause using the sample I provided.

@Lawrence37
Copy link
Collaborator

Lawrence37 commented Sep 12, 2022

It is not fixed. If they need the bug to be reproducible in trunk, then I will supply an updated example upon request. It's actually very easy to do. Just add back the H parameter in gaussHorizontal.

@Thanatomanic
Copy link
Contributor Author

At least things are being looked at seriously, and there is even a recommendation on what to do to avoid issue.
I can confirm that building with -ffp-contract=off avoids the bug. I have not noticed any significant loss of performance.

@Lawrence37
Copy link
Collaborator

@ValZapod maybe I'm misunderstanding, but is seems clear to me that they believe it probably isn't fixed yet. See this comment regarding the bisected commits.

I believe both a are unrelated. The fix possibly caused a missed optimization
while the cause exposed some opportunity.

@Lawrence37
Copy link
Collaborator

Perhaps we are talking about different bugs. The bug in the GCC-compiled example program is fixed. The bug I'm referring to is the one in GCC itself, which is not fixed.

@Lawrence37
Copy link
Collaborator

There is no bug in the program

That's why I wrote "compiled", i.e. the binary executable.

I have not touched my computer for several days. That's one reason why I didn't upload the code. Besides, the GCC people have not asked for a sample that shows the bug in trunk, and I don't want to pollute the thread. If you want to check it, it's easy as I've said before. Just copy-paste the H parameter (and of course add a dummy value for the new argument).

@Thanatomanic
Copy link
Contributor Author

@ValZapod The fact that the problem for the presented file is solved in trunk, seems to me to be unrelated to the bug report. That is also what Richard Biener thinks. He claims the referenced commits are unrelated to the actual problem.

So, while the issue for this file seems magically fixed, it was unlikely due to Lawrence's report. Lawrence claims the bug is still there for an expanded testcase, which we will probably need to submit to the GCC people.

@Lawrence37
Copy link
Collaborator

Back on my computer :)

I can check for you on trunk fast.

You have a newer version of GCC than the one on godbolt.org? Here's the example code with the H parameter added back which triggers the bug even with the trunk build of GCC available on godbolt.

/*
 *  This file is part of RawTherapee.
 *
 *  Copyright (c) 2004-2010 Gabor Horvath <hgabor@rawtherapee.com>
 *
 *  RawTherapee is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  RawTherapee is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with RawTherapee.  If not, see <https://www.gnu.org/licenses/>.
 */
#include <algorithm>
#include <cmath>
#include <iostream>
#include <vector>

template<class T> void calculateYvVFactors( const T sigma, T &b1, T &b2, T &b3, T &B, T M[3][3])
{
    // coefficient calculation
    T q;

    if (sigma < 2.5) {
        q = 3.97156 - 4.14554 * sqrt (1.0 - 0.26891 * sigma);
    } else {
        q = 0.98711 * sigma - 0.96330;
    }

    T b0 = 1.57825 + 2.44413 * q + 1.4281 * q * q + 0.422205 * q * q * q;
    b1 = 2.44413 * q + 2.85619 * q * q + 1.26661 * q * q * q;
    b2 = -1.4281 * q * q - 1.26661 * q * q * q;
    b3 = 0.422205 * q * q * q;
    B = 1.0 - (b1 + b2 + b3) / b0;

    b1 /= b0;
    b2 /= b0;
    b3 /= b0;

    for (int i = 0; i < 9; i++) {
        M[i/3][i%3] = 0;
    }
}

// fast gaussian approximation if the support window is large
template<class T> void gaussHorizontal (T* src, T* dst, const int W, const int H, const double sigma)
{
    double b1, b2, b3, B, M[3][3];
    calculateYvVFactors<double>(sigma, b1, b2, b3, B, M);

    for (int i = 0; i < 3; i++)
        for (int j = 0; j < 3; j++) {
            M[i][j] /= (1.0 + b1 - b2 + b3) * (1.0 + b2 + (b1 - b3) * b3);
        }

    double temp2[W];
    std::fill(temp2, temp2 + W, 0);

    for (int j = 3; j < W; j++) {
        // FIXME: Bug is here!
        temp2[j] = B * src[j] + b1 * temp2[j - 1] + b2 * temp2[j - 2] + b3 * temp2[j - 3];
    }

    for (int j = W - 4; j >= 0; j--) {
        // FIXME: and/or here!
        temp2[j] = B * temp2[j] + b1 * temp2[j + 1] + b2 * temp2[j + 2] + b3 * temp2[j + 3];
    }

    for (int j = 0; j < W; j++) {
        dst[j] = (T)temp2[j];
    }
}

template<class T> void gaussianBlurImpl(T* src, T* dst, const int W, const double sigma)
{
    gaussHorizontal<T> (src, dst, W, 1, sigma);
    
    double b1, b2, b3, B, M[3][3];
    calculateYvVFactors<double>(sigma, b1, b2, b3, B, M);
}

void gaussianBlur(float* src, float* dst, const int W, const double sigma)
{
    gaussianBlurImpl<float>(src, dst, W, sigma);
}

int main() {
    constexpr int w = 100;
    std::vector<float> src_data(w, 1);
    std::vector<float> dst_data(w);

    gaussianBlur(src_data.data(), dst_data.data(), w, 30.0);

    std::cout << dst_data[w/2] << std::endl;

    return 0;
}

Lawrence37 added a commit to Lawrence37/RawTherapee that referenced this issue Sep 18, 2022
Replace the previous workaround of setting -fno-tree-loop-vectorize for
a GCC optimization bug.

(Beep6581#6384)
Lawrence37 added a commit that referenced this issue Sep 18, 2022
Replace the previous workaround of setting -fno-tree-loop-vectorize for
a GCC optimization bug.

(#6384)
@Lawrence37
Copy link
Collaborator

From the GCC man page:

-ffp-contract=off disables floating-point expression contraction.
-ffp-contract=fast enables floating-point expression contraction
such as forming of fused multiply-add operations if the target has
native support for them. -ffp-contract=on enables floating-point
expression contraction if allowed by the language standard. This
is currently not implemented and treated equal to
-ffp-contract=off.

On and off are currently the same, but off is more correct for disabling the problematic FMA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: critical Urgently needs fixing scope: compilation Compilation issues type: bug Something is not doing what it's supposed to be doing
Projects
None yet
Development

No branches or pull requests

7 participants
@Thanatomanic @Floessie @heckflosse @Desmis @Lawrence37 and others