Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pre-C99 emulated float functions and require hardware FMA support on Windows #12519

Open
wants to merge 6 commits into
base: trunk
Choose a base branch
from

Conversation

dra27
Copy link
Member

@dra27 dra27 commented Aug 31, 2023

This is a proposal for #12513 for trunk. There are two intertwined facets to this PR, which mostly deletes (or moves) code.

Require hardware FMA on Windows

Cygwin has an intentionally "broken" implementation of fma (it is simply return x * y + z!). mingw-w64 is known to be broken as, unsurprisingly, is Visual Studio. Since Visual Studio 2019, where hardware support is available, hardware support is always used when available (in previous versions, the optimiser was capable of emitting FMA instructions, but never for the actual C99 function itself).

The hardware support in Visual Studio interacts badly with VirtualBox bug #15471, which for reasons known only to Oracle, masks out the FMA bit from the cpuid. However, it is only the masking which is incorrect - code using FMA instructions correctly executes. In terms of cpuid, it is just about possible to conflate AVX2 (which is reported in VirtualBox) with FMA, because they were released in the same architecture (indeed, the cl flag for forcing FMA is -arch:AVX2), but while no Intel or AMD chips exist which support AVX2 but not FMA, there are VIA chips which have AVX2 but not FMA.

However, we're talking about quite old CPUs here (pre-2013), affecting just Float.fma. Certified Windows hardware now requires Broadwell (2015) CPUs for both Windows 10 and Windows 11 (NB that's certification - Windows will physically run on anything from 2008's Nehalem). This PR therefore proposes a big simplification for Windows, by always synthesising caml_fma using vfmadd132sd. Since the cl option to force this also enables AVX2, it's necessary to put caml_fma in a C file of its own, or the entire runtime becomes unusable on pre-Haswell CPUs.

There then remains the question of what happens if you try to use Float.fma on Windows on an old CPU. @kit-ty-kate has been using an old laptop for Windows testing, and I took a trip to my loft and fished out an old 2012 laptop with an Intel i7-2760QM (AVX, no AVX2 or FMA, therefore) to have a look. If Float.fma is called, the program will by default abort with EXCEPTION_ILLEGAL_INSTRUCTION (0xc000001d). Cygwin should translate that to a more Unix-ly familiar SIGILL.

The VirtualBox bug means there's not much value in us doing anything with the cpuid (@jonahbeckford suggests this is one of the most common bug reports installing Windows OCaml for Diskuv).

I went down a little rabbit-hole to see if we could do something better in that situation. For the MSVC port, it's pretty straightforward to catch asynchronous exceptions via Structured Exception Handling (SEH) (using cl's __try and __except extensions), although entertainingly there seems to be a mis-compilation bug in Visual Studio 2019 (the code compiles correctly in Visual Studio 2022).

GCC doesn't implement the Microsoft extensions for SEH, but it's possible to use Vectored Exception Handling (VEH) instead. It's not pretty, though - it's a very similar trick to stack overflow detection in Windows in OCaml 4.x.

Since Float.fma is declared noalloc, we can't raise an exception so instead the function returns Float.nan on all inputs when the support is missing.

I've left the two commits in place for now, but I'm not convinced - especially as it's tricky to test. A possibility might be to include a helper function (similar to Unix.has_symlink) which could do the test much less invasively (that would also be much easier to test, because such a function could be easily tested with __ud2() so that the logic could be verified in CI).

Emulated C99 float operations; retiring --enable-imprecise-c99-float-ops

#944 added various C99 float operations. To build with old (but supported) versions of Visual Studio, configure has a --enable-imprecise-c99-float-ops option which turns on various emulations (this emulation is automatically enabled for Windows, but can be manually enabled on any platform). When the MSVC port of OCaml returns to trunk, it will necessarily require at least Visual Studio 2022, which has all of these functions. This PR therefore removes the emulations.

The mingw-w64 port has a long-known bug in its implementation of round. The work-around for this remains, but the detection of it is only used for mingw-w64 (pedantically, configuring with --disable-imprecise-c99-float-ops continues to cause a configuration error for mingw-w64, therefore). The enormous emulated version of caml_fma is gone, though.

The unboxed versions are now truly unboxed, which results in a visible renaming of the primitives - i.e. caml_cbrt is removed and Float now directly calls cbrt.

#8684 added caml_log1p to caml/misc.h outside of CAML_INTERNALS. This function no longer exists, so is removed - technically, that's a breaking API change, but I think that the addition of caml_log1p without the CAML_INTERNALS guard was a minor oversight in a much more complicated change!

@shindere
Copy link
Contributor

Nothinglooks wrong to me but I am not an expert so this should be reviewed
also by somebody else.

For my part, the only remark I have is about the Makefile change done in the
first commit. How about introducing a configured FMA_CFLAGS build
varyable? I feel it would be more principled than the currnet
implementation, wouldn't it?

@shindere
Copy link
Contributor

As a sidenote: in aclocal.m4 we have tests like

AS_IF([test x"$hard_error" = "xtrue"], ...)

I tend to rewrite thoselike this:

AS_IF([test $hard_error], ...)

because this looks cleaner and is more concise. HOwever, the price topay for
that is that we fork to run bin/true or bin/false so the less concise
solution has the advantage of beingmore efficient.

If it was only me I woul'nt mind paying the price of a fork for
something easier to read but others may have different opinions, especially
when running on Windows.

@xavierleroy
Copy link
Contributor

xavierleroy commented Sep 11, 2023

Not a proper review at this time, just some questions and remarks based on a quick reading.

  • Why return NaN if the FMA instruction is not available? That's very unhelpful for the numerical programmer. Just letting the program crash would be more informative already. Raising a proper OCaml exception would be much better.
  • Intercepting machine traps with SEH or VEH always makes me nervous. The alternative is to test for the availability of the FMA instruction using the __get_cpuid intrinsic. The result can be cached. I could show sample code later.
  • It's a sad fact of x86 compilers that intrinsics cannot be used unless the appropriate flags are passed to the C compiler, which then starts generating code specific to a family of x86 processors. So, putting intrinsics in a separate file with a special compilation rule, like this PR does, is probably unavoidable. On the other hand, I would name it something more general like intrinsics.c so that it can be used for other intrinsics of interest. (For example: conversions for 16-bit FP format...)

@dra27
Copy link
Member Author

dra27 commented Sep 17, 2023

Thanks for the feedback! Answering each:

  • As I said, I wasn't at all convinced by the VEH part (neither really the SEH MSVC version) - likewise the same problem with returning NaN. The function's declared noalloc, I don't think there's any safe trick for raising an exception, is there? That said, as long as we can have Float.has_fma , I don't see any problem with the runtime terminating if Float.fma is then called?
  • __get_cpuid doesn't work here - that's the VirtualBox bug referenced. Sadly, the whole awareness of this has come from actual users using VirtualBox, as opposed to me reading about it 🙂 In this case, __get_cpuid will indicate that we don't have fma when we the CPU actually does.
  • Agreed with having intrinsics.c instead.
  • Sure for the FMA_CFLAGS - it's not my instinct to do it, but it does put all the logic for working out platform-specific flags in one place, as opposed to the present mix of configure and build system.

The emulated version of `caml_fma` is retired - OCaml either assumes
hardware support or an ISO C99-compliant software implementation
(e.g. glibc).
caml_log1p wasn't guarded with CAML_INTERNALS in misc.h but this was
almost certainly just an oversight (technically this is an API-breaking
change, therefore).
Now that the C99 functions are required, use them directly instead of
via one-line intermediaries. caml_fma remains, to allow for adding error
trapping on Windows.
fma returns nan for pre-Haswell or pre-Piledriver CPUs.
@amongonz
Copy link

amongonz commented Dec 7, 2023

@dra27

However, we're talking about quite old CPUs here

Intel has been releasing new CPUs without AVX2 or AVX as late as 2021, under the Pentium Silver/Gold and Celeron brands. So it isn't just 2012 hardware, even if these are budget and/or low-power CPUs.

@dra27
Copy link
Member Author

dra27 commented Dec 11, 2023

@devvydeebug - those chips presumably support the FMA instruction, though? Is there any easy way to get a list of them?

The reference to "quite old CPUs here" specifically means CPUs without hardware FMA support, rather than CPUs without AVX2 (hence the rest of the sentence "affecting just Float.fma"). AVX2 only comes into the discussion because of the compiler option and the overlapping of their introduction in Haswell and the need to separate the compilation of the function from the rest of the runtime precisely to deal with non-AVX2 hardware.

@amongonz
Copy link

amongonz commented Dec 11, 2023

@dra27

those chips presumably support the FMA instruction, though?

I don't think so. These CPUs can't support the full FMA3 set because they lack AVX(1) as well, so they can't handle ymm operands. Maybe they support the xmm subset of it, but I doubt it considering there isn't a separate feature flag for it like there is for AVX-512 fma.

I guess I could actually test executing vfmadd132sd on one, just to be sure. I think I could borrow a PC with such a CPU for a few minutes.

Is there any easy way to get a list of them?

For some reason CPU manufacturers insist on not providing actual databases, just clunky web listings with too much javascript yet very bad search forms. But checking https://ark.intel.com I see that most Intel chips released between 2017 and 2021 under the Pentium Silver/Gold and Celeron brands lack AVX, with some exceptions. (FMA is not listed on its own.)


I'm not necessarily saying this shouldn't go ahead, just wanted to make it clear it affects recent low budget hardware as well. Maybe attempting a vfmadd132sd during initialization and falling back to C99's fma if it fails is worth considering.

@amongonz
Copy link

amongonz commented Dec 11, 2023

I can confirm that a Pentium Gold G6400 (launched Q2 2020) can't execute vfmadd132sd.

I just assembled a my_fma function as vfmadd132sd, ret and combined it with a C program that reads 3 doubles and calls it, compiled without AVX. The same executable works on my machine, but crashes on the Pentium.

@dra27
Copy link
Member Author

dra27 commented Dec 12, 2023

Thanks, @devvydeebug - that's very useful info. There being newer chips without vfmadd132sd definitely means we should at least have something like Float.has_fma. It is also making me wonder if we should leave the ability to switch to an emulated version, even if we don't carry the code ourselves anymore (looking at, say, openlibm's software implementation of fma, but I haven't actually checked how it performs on our tests compared with the caml_fma already in runtime/floats.c)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants