Remove pre-C99 emulated float functions and require hardware FMA support on Windows #12519

dra27 · 2023-08-31T10:24:18Z

This is a proposal for #12513 for trunk. There are two intertwined facets to this PR, which mostly deletes (or moves) code.

Require hardware FMA on Windows

Cygwin has an intentionally "broken" implementation of fma (it is simply return x * y + z!). mingw-w64 is known to be broken as, unsurprisingly, is Visual Studio. Since Visual Studio 2019, where hardware support is available, hardware support is always used when available (in previous versions, the optimiser was capable of emitting FMA instructions, but never for the actual C99 function itself).

The hardware support in Visual Studio interacts badly with VirtualBox bug #15471, which for reasons known only to Oracle, masks out the FMA bit from the cpuid. However, it is only the masking which is incorrect - code using FMA instructions correctly executes. In terms of cpuid, it is just about possible to conflate AVX2 (which is reported in VirtualBox) with FMA, because they were released in the same architecture (indeed, the cl flag for forcing FMA is -arch:AVX2), but while no Intel or AMD chips exist which support AVX2 but not FMA, there are VIA chips which have AVX2 but not FMA.

However, we're talking about quite old CPUs here (pre-2013), affecting just Float.fma. Certified Windows hardware now requires Broadwell (2015) CPUs for both Windows 10 and Windows 11 (NB that's certification - Windows will physically run on anything from 2008's Nehalem). This PR therefore proposes a big simplification for Windows, by always synthesising caml_fma using vfmadd132sd. Since the cl option to force this also enables AVX2, it's necessary to put caml_fma in a C file of its own, or the entire runtime becomes unusable on pre-Haswell CPUs.

There then remains the question of what happens if you try to use Float.fma on Windows on an old CPU. @kit-ty-kate has been using an old laptop for Windows testing, and I took a trip to my loft and fished out an old 2012 laptop with an Intel i7-2760QM (AVX, no AVX2 or FMA, therefore) to have a look. If Float.fma is called, the program will by default abort with EXCEPTION_ILLEGAL_INSTRUCTION (0xc000001d). Cygwin should translate that to a more Unix-ly familiar SIGILL.

The VirtualBox bug means there's not much value in us doing anything with the cpuid (@jonahbeckford suggests this is one of the most common bug reports installing Windows OCaml for Diskuv).

I went down a little rabbit-hole to see if we could do something better in that situation. For the MSVC port, it's pretty straightforward to catch asynchronous exceptions via Structured Exception Handling (SEH) (using cl's __try and __except extensions), although entertainingly there seems to be a mis-compilation bug in Visual Studio 2019 (the code compiles correctly in Visual Studio 2022).

GCC doesn't implement the Microsoft extensions for SEH, but it's possible to use Vectored Exception Handling (VEH) instead. It's not pretty, though - it's a very similar trick to stack overflow detection in Windows in OCaml 4.x.

Since Float.fma is declared noalloc, we can't raise an exception so instead the function returns Float.nan on all inputs when the support is missing.

I've left the two commits in place for now, but I'm not convinced - especially as it's tricky to test. A possibility might be to include a helper function (similar to Unix.has_symlink) which could do the test much less invasively (that would also be much easier to test, because such a function could be easily tested with __ud2() so that the logic could be verified in CI).

Emulated C99 float operations; retiring `--enable-imprecise-c99-float-ops`

#944 added various C99 float operations. To build with old (but supported) versions of Visual Studio, configure has a --enable-imprecise-c99-float-ops option which turns on various emulations (this emulation is automatically enabled for Windows, but can be manually enabled on any platform). When the MSVC port of OCaml returns to trunk, it will necessarily require at least Visual Studio 2022, which has all of these functions. This PR therefore removes the emulations.

The mingw-w64 port has a long-known bug in its implementation of round. The work-around for this remains, but the detection of it is only used for mingw-w64 (pedantically, configuring with --disable-imprecise-c99-float-ops continues to cause a configuration error for mingw-w64, therefore). The enormous emulated version of caml_fma is gone, though.

The unboxed versions are now truly unboxed, which results in a visible renaming of the primitives - i.e. caml_cbrt is removed and Float now directly calls cbrt.

#8684 added caml_log1p to caml/misc.h outside of CAML_INTERNALS. This function no longer exists, so is removed - technically, that's a breaking API change, but I think that the addition of caml_log1p without the CAML_INTERNALS guard was a minor oversight in a much more complicated change!

shindere · 2023-09-11T15:34:15Z

Nothinglooks wrong to me but I am not an expert so this should be reviewed
also by somebody else.

For my part, the only remark I have is about the Makefile change done in the
first commit. How about introducing a configured FMA_CFLAGS build
varyable? I feel it would be more principled than the currnet
implementation, wouldn't it?

shindere · 2023-09-11T15:46:32Z

As a sidenote: in aclocal.m4 we have tests like

AS_IF([test x"$hard_error" = "xtrue"], ...)

I tend to rewrite thoselike this:

AS_IF([test $hard_error], ...)

because this looks cleaner and is more concise. HOwever, the price topay for
that is that we fork to run bin/true or bin/false so the less concise
solution has the advantage of beingmore efficient.

If it was only me I woul'nt mind paying the price of a fork for
something easier to read but others may have different opinions, especially
when running on Windows.

xavierleroy · 2023-09-11T17:01:00Z

Not a proper review at this time, just some questions and remarks based on a quick reading.

Why return NaN if the FMA instruction is not available? That's very unhelpful for the numerical programmer. Just letting the program crash would be more informative already. Raising a proper OCaml exception would be much better.
Intercepting machine traps with SEH or VEH always makes me nervous. The alternative is to test for the availability of the FMA instruction using the __get_cpuid intrinsic. The result can be cached. I could show sample code later.
It's a sad fact of x86 compilers that intrinsics cannot be used unless the appropriate flags are passed to the C compiler, which then starts generating code specific to a family of x86 processors. So, putting intrinsics in a separate file with a special compilation rule, like this PR does, is probably unavoidable. On the other hand, I would name it something more general like intrinsics.c so that it can be used for other intrinsics of interest. (For example: conversions for 16-bit FP format...)

dra27 · 2023-09-17T13:22:48Z

Thanks for the feedback! Answering each:

As I said, I wasn't at all convinced by the VEH part (neither really the SEH MSVC version) - likewise the same problem with returning NaN. The function's declared noalloc, I don't think there's any safe trick for raising an exception, is there? That said, as long as we can have Float.has_fma , I don't see any problem with the runtime terminating if Float.fma is then called?
__get_cpuid doesn't work here - that's the VirtualBox bug referenced. Sadly, the whole awareness of this has come from actual users using VirtualBox, as opposed to me reading about it 🙂 In this case, __get_cpuid will indicate that we don't have fma when we the CPU actually does.
Agreed with having intrinsics.c instead.
Sure for the FMA_CFLAGS - it's not my instinct to do it, but it does put all the logic for working out platform-specific flags in one place, as opposed to the present mix of configure and build system.

The emulated version of `caml_fma` is retired - OCaml either assumes hardware support or an ISO C99-compliant software implementation (e.g. glibc).

caml_log1p wasn't guarded with CAML_INTERNALS in misc.h but this was almost certainly just an oversight (technically this is an API-breaking change, therefore).

Now that the C99 functions are required, use them directly instead of via one-line intermediaries. caml_fma remains, to allow for adding error trapping on Windows.

fma returns nan for pre-Haswell or pre-Piledriver CPUs.

amongonz · 2023-12-07T20:00:25Z

@dra27

However, we're talking about quite old CPUs here

Intel has been releasing new CPUs without AVX2 or AVX as late as 2021, under the Pentium Silver/Gold and Celeron brands. So it isn't just 2012 hardware, even if these are budget and/or low-power CPUs.

dra27 · 2023-12-11T14:38:41Z

@devvydeebug - those chips presumably support the FMA instruction, though? Is there any easy way to get a list of them?

The reference to "quite old CPUs here" specifically means CPUs without hardware FMA support, rather than CPUs without AVX2 (hence the rest of the sentence "affecting just Float.fma"). AVX2 only comes into the discussion because of the compiler option and the overlapping of their introduction in Haswell and the need to separate the compilation of the function from the rest of the runtime precisely to deal with non-AVX2 hardware.

amongonz · 2023-12-11T16:15:46Z

@dra27

those chips presumably support the FMA instruction, though?

I don't think so. These CPUs can't support the full FMA3 set because they lack AVX(1) as well, so they can't handle ymm operands. Maybe they support the xmm subset of it, but I doubt it considering there isn't a separate feature flag for it like there is for AVX-512 fma.

I guess I could actually test executing vfmadd132sd on one, just to be sure. I think I could borrow a PC with such a CPU for a few minutes.

Is there any easy way to get a list of them?

For some reason CPU manufacturers insist on not providing actual databases, just clunky web listings with too much javascript yet very bad search forms. But checking https://ark.intel.com I see that most Intel chips released between 2017 and 2021 under the Pentium Silver/Gold and Celeron brands lack AVX, with some exceptions. (FMA is not listed on its own.)

I'm not necessarily saying this shouldn't go ahead, just wanted to make it clear it affects recent low budget hardware as well. Maybe attempting a vfmadd132sd during initialization and falling back to C99's fma if it fails is worth considering.

amongonz · 2023-12-11T17:27:14Z

I can confirm that a Pentium Gold G6400 (launched Q2 2020) can't execute vfmadd132sd.

I just assembled a my_fma function as vfmadd132sd, ret and combined it with a C program that reads 3 doubles and calls it, compiled without AVX. The same executable works on my machine, but crashes on the Pentium.

dra27 · 2023-12-12T06:54:23Z

Thanks, @devvydeebug - that's very useful info. There being newer chips without vfmadd132sd definitely means we should at least have something like Float.has_fma. It is also making me wonder if we should leave the ability to switch to an emulated version, even if we don't carry the code ourselves anymore (looking at, say, openlibm's software implementation of fma, but I haven't actually checked how it performs on our tests compared with the caml_fma already in runtime/floats.c)

This was referenced Aug 31, 2023

Relax fma check in configure for Visual Studio #12518

Merged

Make --enable-imprecise-c99-float-ops the default on CPU that doesn't support FMA #12513

Open

gasche added runtime-system windows labels Oct 7, 2023

dra27 added 6 commits October 20, 2023 11:55

Require hardware support for fma on Windows

faf8f86

The emulated version of `caml_fma` is retired - OCaml either assumes hardware support or an ISO C99-compliant software implementation (e.g. glibc).

Remove emulated C99 float function support

a0cfc5f

Use log1p directly in memtrace.c

9a8ca57

caml_log1p wasn't guarded with CAML_INTERNALS in misc.h but this was almost certainly just an oversight (technically this is an API-breaking change, therefore).

Simplify runtime/floats.c further

a4f2841

Now that the C99 functions are required, use them directly instead of via one-line intermediaries. caml_fma remains, to allow for adding error trapping on Windows.

Use SEH to detect missing fma with MSVC

909a702

fma returns nan for pre-Haswell or pre-Piledriver CPUs.

Use VEH to catch missing fma for mingw and Cygwin

4807d08

dra27 force-pushed the hardware-fma branch from 84fa92e to 4807d08 Compare October 20, 2023 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove pre-C99 emulated float functions and require hardware FMA support on Windows #12519

Remove pre-C99 emulated float functions and require hardware FMA support on Windows #12519

dra27 commented Aug 31, 2023

shindere commented Sep 11, 2023

shindere commented Sep 11, 2023

xavierleroy commented Sep 11, 2023 •

edited

dra27 commented Sep 17, 2023

amongonz commented Dec 7, 2023

dra27 commented Dec 11, 2023

amongonz commented Dec 11, 2023 •

edited

amongonz commented Dec 11, 2023 •

edited

dra27 commented Dec 12, 2023

Remove pre-C99 emulated float functions and require hardware FMA support on Windows #12519

Are you sure you want to change the base?

Remove pre-C99 emulated float functions and require hardware FMA support on Windows #12519

Conversation

dra27 commented Aug 31, 2023

Require hardware FMA on Windows

Emulated C99 float operations; retiring --enable-imprecise-c99-float-ops

shindere commented Sep 11, 2023

shindere commented Sep 11, 2023

xavierleroy commented Sep 11, 2023 • edited

dra27 commented Sep 17, 2023

amongonz commented Dec 7, 2023

dra27 commented Dec 11, 2023

amongonz commented Dec 11, 2023 • edited

amongonz commented Dec 11, 2023 • edited

dra27 commented Dec 12, 2023

Emulated C99 float operations; retiring `--enable-imprecise-c99-float-ops`

xavierleroy commented Sep 11, 2023 •

edited

amongonz commented Dec 11, 2023 •

edited

amongonz commented Dec 11, 2023 •

edited