-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in OCaml 4.08 #9326
Comments
[by the way, do you know folks where are the OCaml performance test stored?] |
There are no performance tests in the compiler distribution (some parts of the testsuite, typically |
Thanks @gasche , I was wondering if some of the people measuring perf had found the same regression as us. I was not able to get a meaninful result from the ocamlpro site; I knew about the ocamllabs site but it seems limited to 4.06+multicore, right? |
Regarding where regressions may be: many items in the changelog could lead to performance differences, and it's fairly hard to tell. One first thing to look at may be that 4.08 is the first release to use an autoconf-based configuration (before OCaml had a manually-written configure script). Because the various feature detection logic has been rewritten, it is possible that a stock 4.08 compiler gets configured differently from a stock 4.07 compiler in a way that impacts performances. @ejgallego I don't remember reports of similar regressions from industrial OCaml users (who typically do run performance tests on their own programs when they switch to a new OCaml version). |
We can reproduce with a standard
In fact I was wondering about this precise point, how come nobody else had noticed yet. I guess it could be something very specific related to Coq; or maybe some compute-intensive users are still in 4.07 ? |
Also, I remember seeing quite degraded performance with 4.08+flambda some time back; I didn't research a lot as I had no time back then. |
I believe that at least Lexifi and Jane Street have migrated to 4.08 and then 4.09. One candidate for a performance regression for Coq could be #1895 : it fixes the capture of callstacks/backtraces in new threads, which would previously only capture the last stack frame. If Coq captures callstacks/backtraces and does computations in threads, it could be hit by a performance regression coming from the callstacks suddenly being much longer. There are many other PRs that could be performance regressions if we are unlucky. For example #2060 and #2070 might result in deoptimization if there is a mistake in them. #2083 is a bugfix that disables some unsafe unboxing in presence of GADT; if a single hot function from Coq is affected, it could give a visible performance degradation. Seeing a performance profile of a Coq run under both versions might help in narrowing the issue down? |
Thanks a lot for the pointers @gasche ; indeed the preliminary plan could look like:
IMHO it is still hard to tell what the cause could be; all of the options you propose are for now under the table. |
Can you confirm that the numbers given are for x86_64 machines, in particular not 32bit machines? (For example #1683 affects marshalling of Custom values, and I think Coq uses Custom blocks to represent VM's uint63 on 32-bits machines.) |
For now x86_64 , also the slowdowns don't seem to be tied to developments making heavy use of the VM. |
We suspect multiple sources of slowdowns around the 4.08 release, but unfortunately they haven't yet been tracked down precisely. Have you seen #8776 ? |
Thanks @nojb ; note that in this case we are not talking about the compiler itself, but the code produced by the compiler is in some cases 30% slower. |
Of course would the compiler itself be affected by same Coq bug that would explain why there is an overall slowdown. I wonder if systematic benchs should be introduced at the PR level; in Coq we are limited by the amount of available hardware (so we only test selected PRs) but the plan is to do so. |
Well, PR-level performance monitoring is of course a great idea, but it is also a lot of work, and so far no one has been willing to do this consistently for the OCaml trunk. (#8776 is a compile-time regression due to a change in type-checking behavior, it should not influence programs that do not call the type-checker at runtime.) |
Is the slowdown observed with the flambda optimizer or without? Without flambda and using KB and a few other old tests as benchmarks, I'm not seeing significant performance differences between 4.07.1, 4.08.1, and 4.09.0. Will try flambda next. |
I believe that Emilio's numbers come from the standard ocaml-base-compiler-{4.07.1,4.08.1} switches, with default configuration options, so in particular without flambda. I reviewed the opam files to check if there were configuration differences, and did not find much. (4.07.1 uses @ejgallego: your 4.08/4.09 benchmark failed because it appears to use an invalid switch name: there is no 4.09.1 release yet. |
In case that's useful, this is the file with the biggest slowdown, of 28%. There are some more that are almost as bad, but they have even longer dependency chains. |
The numbers on the Coq bench are without flambda, I am not sure what @RalfJung is using in their CI but given that he spoke about the "Debian" version that would make it flambda [for Iris we use flambda in Coq's CI as the improvement is noticeable. I do remember seeing an slowdown in 4.08.0+flambda too, I will run some benchs with flambda once the current one finish.
@gasche indeed I got confused by opam having a 4.09.1 ocaml package; link updated; I've also added a 4.07.1 vs 4.09.0 bench. |
We are using the So, no flambda. |
Note: Emilio's benchmarks are configured so that "NEW" is the older OCaml compiler, and "OLD" is the newer OCaml compiler. So -7% on the first benchmarks: 4.07.1 vs 4.08.1 means that Compcert builds 7% faster on 4.07.1 than 4.08.1, and +2% on the second benchmarks: 4.08.1 vs 4.09.0 indicates that Compcert builds 2% slower on 4.08.1 than 4.09.0. |
Sorry for that, it was not intended; I got confused in the UI. |
Benchs updated; compiling ssreflect takes the longest hit [around 10% on average] so we will try work a reduced test case from that. |
@xavierleroy for now the easiest way to reproduce is to compile ssreflect ; we are working on isolating a reduced case from that. |
Numbers from the 4.07.1 vs 4.10.0 build:
Note that this are better than just 4.07.1 to 4.08.1 ; there is also a big improvement in a few developments in 4.10 but not in 4.08.1 so IMO that's orthogonal to the regression at question here. |
Are some entries missing from that table? lambdarust/iris should usually be in there. |
There is this line in the 4.08.1 vs 4.07.1 results:
This suggests that 4.07.1 is 3.16% faster than 4.08.1 at checking coq-lambda-rust. I have looked at the numbers in general and I think that they point out at a clear regression between 4.07.1 and 4.08.1 (with some independent improvements in further versions). There is not much that I could tell from these numbers, and I stopped looking at them. To go further I would personally need one of:
|
I would add to Gabriel's list:
|
@lpw25 usually the noise range on our bench is around 1%. |
I was able to reproduce the slowdown by running Coq on CompCert files. It looks like the major GC is working harder in 4.08.1 than in 4.07.1, in the context of Coq at least. @damiendoligez is looking into it. |
Dear all, some new results w.r.t. 4.11:
Even if best_fit seems to improve a some cases, some users seem to be still hit (4.07.1 vs 4.11.1 + best_fit) :
|
As far as I know, we still have no idea of which change precisely in 4.08+dev is introducing the regression. What should we do to work on this? In #9326 (comment) I mentioned a handful of PRs that might be involved in the regression. Maybe I could try to provide experimental variants of 4.08 with those PRs reverted, and then you ( @ejgallego ) would run the benchmarks (or just a couple benchmark that suffice the regression) on them? |
I'm afraid I don't have the cycles to run the bisect right now, sorry. Something that would make this much easier would be to allow to compose the compiler build with OCaml libraries, as we support for Coq when using dune. |
That would help, thanks, however our bench system has changed and it is much harder now to try new switches / package subsets :S |
I dug into this a bit, and I think I have an explanation. The cause is a bug in I could reproduce the issue by comparing builds of mathcomp-1.12 using Coq 8.12.2 on OCaml 4.07.1 and 4.08.1. Many of the coqc invocations are ~10% slower on 4.08.1. It's not a code layout issue (4.08.1 is actually executing ~10% more instructions), and the extra time is spent in GC. I arbitrarily picked one file (
Coq uses non-default GC settings for speed, corresponding to an Weirdly, the performance difference goes away when the environment variable is set:
The first two rows should be identical: the second row is explicitly setting what the first row uses as a default. The issue is that Coq does
This means that every call to |
(I arrived just in time to reserve the last cool smiley for myself.) Thanks for the fascinating find! It is surprising that other users have not caught the issue before. It sounds like explicit GC settings are not that common, or people don't do that much differential performance testing against compiler versions. There should be a small-but-noticeable performance regression in all programs that use explicit Gc settings (in the code) and also custom blocks. The bug is very easy to fix. If I understand correctly, fixing it may improve the performance of Coq (and other programs) on all >=4.08 versions. |
Great detective work! @damiendoligez and I did observe that the major GC was working harder in 4.08 than in 4.07 on those runs of Coq, but the root cause was unknown. In retrospect, it also explains why some runs of Coq are affected more than others: some runs read more files than others. |
A `Long_val` conversion was missing. This was setting the wrong values for the `custom_` parameters, causing the major GC to work too much, slowing down some programs. Fixes: ocaml#9326
Obvious fix + regression test in #10125. |
Some `Long_val` conversions were missing. This was setting the wrong values for the `custom_` parameters, causing the major GC to work too much, slowing down some programs. Add regression test. Fixes: ocaml#9326
For other archeologists like me:
The link now points to the wrong code, the correct perma-link is non-default GC settings for speed, and the current permalinked location as of the writing of this comment is here |
Dear OCaml devs,
@RalfJung has pointed out a non-trivial performance regression in Coq when moving from OCaml 4.07 to OCaml 4.08 coq/coq#11652
See some numbers here, for example CompCert does compile more than 7% faster in 4.07
There are a few candidates in the changelog for the slowdown, I think we will try reverting some PRs and seeing what happens with a reduced file. Any ideas? [For example #2144 could be relevant? I doubt it tho]
Benchs:
The text was updated successfully, but these errors were encountered: