New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ephemeron segfault with missing write barrier #9391
Comments
We've seen crashes in Coq on Fedora Rawhide. The stack trace is shown at the bottom of the following email: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/EANYCMJWU5RQNCJ4Y5JMISGZ7A3XBUAM/ Could this be related to this bug? |
It's hard to say. That stacktrace looks like it might be corrupted: I can't see any route by which |
It's very hard to reproduce. I usually have to run the build over and over for days to see this even once. I also cannot see a way to get from |
Do you have a core dump? |
Right now I have core dumps, but the original coqc binary from the build directory was deleted so they are not much use. However I will now run the build in a loop until it crashes again, and will collect the binary, compiler and core dump and upload them for you. I must stress this process may take many hours (maybe even days). |
It would be worth preserving the environment on the build machine if possible, as some addresses won't resolve correctly having transported the executable elsewhere. |
I killed this just now. It ran for 2 days without reproducing the bug. What changed is that our Coq package has moved to the final Coq / OCaml 4.10.0 patch. Both new and old patches touch the GC, so it's entirely possible that the old patch was simply to blame. |
Weak pointers indeed are marked during read not during write because during mark scanning they don't directly mark the destination when the source is marked.
Do you think this seems sound, and should not segfault (if the segfault is that a not marked value is returned by Now that "what should be" is explained, I will debug your test case. |
I'm not able to obtain the segfault only 1901 hellos, with 1d76e2c and:
Do you have the segfault with the variant with assertions? EDIT: I don't have segfault also with |
But with 52dc5d7, I have one, and with the debug runtime_variant, this assertion is broken:
|
A bug is that during mark phase, the blit move a white data from a not yet scanned ephemeron to a scanned ephemeron which keys are alive. |
The invariant during mark is described in
@stedolan, the condition for the darkening is a little different from the normal write barrier, but you were completely right that it was missing in I still must check the condition for the set |
I agree that that assertion should hold at the end of marking, but I don't see what makes it hold. What I'd like to see is some description of the marking invariant for ephemerons. Some examples of these: Tri-colour invariant (not what OCaml uses): There are no pointers from white objects to black objects. To enforce this, the write barrier checks to see if it is creating such a pointer and makes one of the objects involved grey. Weak tri-colour invariant (what OCaml uses): All white objects pointed to by black objects are also reachable from gray objects along white paths. To enforce this, the write barrier ensures that it never makes a white object unreachable, by making white objects grey when references to them are overwritten. Both of these invariants ensure correctness of marking, because if they are true when there are no gray objects left then all reachable objects are black. I'd like to understand a similar invariant about ephemerons, but at the moment I don't. It's certainly not one of the two standard invariants above. Is there an invariant for ephemerons that's true throughout marking and implies your assertion when marking completes? |
In
(For (2) it is not the version in master but the version of the branch with the fix but I think it is better to talk on it). And I think it implies my assertions. The functions from weak where breaking the invariant. |
Thanks, that helps, but I'm afraid I still have a bunch of questions! In case (1), what does During marking, are reachable ephemerons in state (1) already marked? How about state (2), assuming What exactly does It looks like ephemerons do not have the snapshot-at-the-beginning property that other values in OCaml have. That is, if an ephemeron and its key are both reachable at the start of marking (and so the ephemeron will not be cleaned and its data is reachable), it looks like it is possible for the data to be collected this cycle if |
I'm noticing that
Nothing. The invariant still old? Sorry, I'm not seeing the finality of the question.
Interesting question, it should be written. I would say the ephemeron themselves could be not marked, it is not a problem. But if the information is needed: I have a doubt for ephemerons without data but ephemerons with data go in (1) only if marked.
They can be marked or not. If they are marked but a key is not marked and they have a data they go to (2).
It means that we don't have to look again in the set (2) to know if one ephemeron is alive and all is keys are alive, so we need to mark the data.
I believe it is true. The datas, and all the values reachable from them, don't have this property. But it is not a problem because we have a read barrier for keys and datas, which mark the values. |
Thanks for the explanations!
Thanks, I think I understand this now. Still not sure of the wording, though. The characterisation of "will not be swept" makes sense, but I don't think this exactly matches the description above. I think it's possible to have an ephemeron in (1) with a white data by calling
I'd expect the answer to be the same regardless of whether the ephemeron has data or not, because if I can get an unmarked ephemeron into (1) without data, then I can call
OK. It sounds like we should also change |
In #9424 it is now marked during
Each time a value goes from white to grey/black |
It's possible for an ephemeron key to change from white to black without any block changing from white to black. Suppose we have an ephemeron E, which is itself marked but has a white key K and a white value V. E is in set (2) and ephe_list_pure is true. We take another key K', which is already black, and we use set_key to modify the key of E. Now E has a black key, a white value, is in set (2), and ephe_list_pure is still true. The invariant of (2) can be broken without the colour of any block changing. (This might already be fixed in 9424?) |
Yes I believe it is. So I agree with you the current state is utterly broken. I'm talking about #9424 for the invariant. We can talk there for clarity if you prefer. Is there a time frame for the fix? I don't have much time currently, but fixing bugs in ephemerons is still a priority for me. |
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc. |
There is a tradeoff between marking data that could be avoiding and inspecting too much keys. The choice here is to keep the computation linear in the argument (we only look at the keys that are blitted or set). rename ephe_list_pure into caml_ephe_list_pure Fixes ocaml#9391
Ephemerons don't use the normal write barrier, and because they're marked incrementally this can lead to the lost object problem, where an object becomes reachable only from an ephemeron that's already marked and never ends up getting marked.
It's difficult to trigger because it relies on the timing of incremental GC, but here's an example that hits the bug on trunk.
I'm not sure what the fix is. It may be as simple as adding darkening in
blit_data
(and maybeset_data
?), but I don't understand what the marking invariants are for ephemerons.The text was updated successfully, but these errors were encountered: