Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeron segfault with missing write barrier #9391

Closed
stedolan opened this issue Mar 24, 2020 · 21 comments
Closed

Ephemeron segfault with missing write barrier #9391

stedolan opened this issue Mar 24, 2020 · 21 comments

Comments

@stedolan
Copy link
Contributor

Ephemerons don't use the normal write barrier, and because they're marked incrementally this can lead to the lost object problem, where an object becomes reachable only from an ephemeron that's already marked and never ends up getting marked.

It's difficult to trigger because it relies on the timing of incremental GC, but here's an example that hits the bug on trunk.

module E = Ephemeron.K1

let key = ref 1
let mk_ephe () =
  let e = E.create () in
  E.set_key e key;
  e

let eph = mk_ephe ()
let other = ref "beep"

let () =
  for stop = 100 to 2000 do
    E.set_data eph (ref "hello");
    Gc.major ();
    Gc.minor (); Gc.major_slice 1 |> ignore; Gc.major_slice 1 |> ignore;
    (* let _ = Array.init 1 (fun _ -> mk_ephe ()) in *)
    let e = mk_ephe () in
    (* let _ = Array.init 3 (fun _ -> mk_ephe ()) in *)
    for i = 1 to 2000 do
      Gc.major_slice 1 |> ignore;
      if i = stop then begin
          E.blit_data eph e;
          E.set_data eph other;
        end;
    done;
    let r = match E.get_data e with None -> assert false | Some r -> r in
    Printf.printf "[%03d] %s\n%!" stop !r;
    Gc.full_major ();
  done

I'm not sure what the fix is. It may be as simple as adding darkening in blit_data (and maybe set_data?), but I don't understand what the marking invariants are for ephemerons.

@rwmjones
Copy link
Contributor

We've seen crashes in Coq on Fedora Rawhide. The stack trace is shown at the bottom of the following email: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/EANYCMJWU5RQNCJ4Y5JMISGZ7A3XBUAM/

Could this be related to this bug?

@stedolan
Copy link
Contributor Author

It's hard to say. That stacktrace looks like it might be corrupted: I can't see any route by which caml_ephe_set_key can call caml_alloc_shr. How hard is it to reproduce?

@rwmjones
Copy link
Contributor

It's very hard to reproduce. I usually have to run the build over and over for days to see this even once. I also cannot see a way to get from caml_ephe_set_key to caml_alloc_shr, but the stack trace is unfortunately all we have.

@mshinwell
Copy link
Contributor

Do you have a core dump?

@rwmjones
Copy link
Contributor

Right now I have core dumps, but the original coqc binary from the build directory was deleted so they are not much use.

However I will now run the build in a loop until it crashes again, and will collect the binary, compiler and core dump and upload them for you. I must stress this process may take many hours (maybe even days).

@mshinwell
Copy link
Contributor

It would be worth preserving the environment on the build machine if possible, as some addresses won't resolve correctly having transported the executable elsewhere.

@rwmjones
Copy link
Contributor

rwmjones commented Apr 1, 2020

I killed this just now. It ran for 2 days without reproducing the bug. What changed is that our Coq package has moved to the final Coq / OCaml 4.10.0 patch. Both new and old patches touch the GC, so it's entirely possible that the old patch was simply to blame.

@bobot
Copy link
Contributor

bobot commented Apr 5, 2020

Weak pointers indeed are marked during read not during write because during mark scanning they don't directly mark the destination when the source is marked.

  • during mark phase, reading marks the data (x) with this criteria:
/** The minor heap is considered alive. */
#if defined (NATIVE_CODE) && defined (NO_NAKED_POINTERS)
/** The minor heap doesn't have to be marked, outside they should
    already be black
*/
Caml_inline int Must_be_Marked_during_mark(value x)
{
  CAMLassert (x != caml_ephe_none);
  CAMLassert (caml_gc_phase == Phase_mark);
  return Is_block (x) && !Is_young (x);
}
#else
Caml_inline int Must_be_Marked_during_mark(value x)
{
  CAMLassert (x != caml_ephe_none);
  CAMLassert (caml_gc_phase == Phase_mark);
  return Is_block (x) && Is_in_heap (x);
}
#endif
  • during clean phase, a data that is not marked is not returned.
#define CAMLassert_not_dead_value(v) do{        \
    CAMLassert ( caml_gc_phase != Phase_clean   \
                 || !Is_block(v)                \
                 || !Is_in_heap (v)             \
                 || !Is_white_val(v) );         \
}while(0)
  • At the end of the clean phase, every remaining data must be marked because of this assertion at the end of caml_ephe_clean which is called before set_data, blit_data and for all ephemerons:
        CAMLassert(caml_gc_phase == Phase_clean);
        ...
        /* If we scanned all the keys and the data field remains filled,
           then the mark phase must have marked it */
        CAMLassert( !(offset_start == 2 && offset_end == Wosize_hd (Hd_val(v))
                      && Is_block (child) && Is_in_heap (child)
                      && Is_white_val (child)));

Do you think this seems sound, and should not segfault (if the segfault is that a not marked value is returned by get_data)?

Now that "what should be" is explained, I will debug your test case.

@bobot
Copy link
Contributor

bobot commented Apr 5, 2020

I'm not able to obtain the segfault only 1901 hellos, with 1d76e2c and:

./configure
make
./ocamlopt.opt -nostdlib -I stdlib test.ml -o test
./test
./ocamlc.opt -nostdlib -I stdlib test.ml -o test
boot/ocamlrun ./test

Do you have the segfault with the variant with assertions?

EDIT: I don't have segfault also with -I runtime -runtime-variant d in native

@bobot
Copy link
Contributor

bobot commented Apr 5, 2020

But with 52dc5d7, I have one, and with the debug runtime_variant, this assertion is broken:

file caml/weak.h; line 208 ### Assertion failed: !(offset_start == 2 && offset_end == Wosize_hd (Hd_val(v)) && Is_block (child) && Is_in_heap (child) && Is_white_val (child))

@bobot
Copy link
Contributor

bobot commented Apr 5, 2020

A bug is that during mark phase, the blit move a white data from a not yet scanned ephemeron to a scanned ephemeron which keys are alive. It is particular to blit because blit tries not to darken things. I'm writing a fix.

@bobot
Copy link
Contributor

bobot commented Apr 5, 2020

The invariant during mark is described in major_gc.c, the most important is the set (1):

      - the ephemerons in (1) have a data alive or none
        (nb: new ephemerons are added in this part by weak.c)

@stedolan, the condition for the darkening is a little different from the normal write barrier, but you were completely right that it was missing in set_data and blit_data.

I still must check the condition for the set (2) during set_key, darkening of the data is perhaps needed, not for a segfault but for correction (not removing a data with alive keys). But it could go in another MR.

@stedolan
Copy link
Contributor Author

  • At the end of the clean phase, every remaining data must be marked because of this assertion at the end of caml_ephe_clean which is called before set_data, blit_data and for all ephemerons:
        CAMLassert(caml_gc_phase == Phase_clean);
        ...
        /* If we scanned all the keys and the data field remains filled,
           then the mark phase must have marked it */
        CAMLassert( !(offset_start == 2 && offset_end == Wosize_hd (Hd_val(v))
                      && Is_block (child) && Is_in_heap (child)
                      && Is_white_val (child)));

Do you think this seems sound, and should not segfault (if the segfault is that a not marked value is returned by get_data)?

I agree that that assertion should hold at the end of marking, but I don't see what makes it hold.

What I'd like to see is some description of the marking invariant for ephemerons. Some examples of these:

Tri-colour invariant (not what OCaml uses): There are no pointers from white objects to black objects. To enforce this, the write barrier checks to see if it is creating such a pointer and makes one of the objects involved grey.

Weak tri-colour invariant (what OCaml uses): All white objects pointed to by black objects are also reachable from gray objects along white paths. To enforce this, the write barrier ensures that it never makes a white object unreachable, by making white objects grey when references to them are overwritten.

Both of these invariants ensure correctness of marking, because if they are true when there are no gray objects left then all reachable objects are black.

I'd like to understand a similar invariant about ephemerons, but at the moment I don't. It's certainly not one of the two standard invariants above. Is there an invariant for ephemerons that's true throughout marking and implies your assertion when marking completes?

@bobot
Copy link
Contributor

bobot commented Apr 19, 2020

In major_gc.c, this invariant is given:

/**
   Ephemerons:
   During mark phase the list caml_ephe_list_head of ephemerons
   is iterated by different pointers that follow the invariants:
   caml_ephe_list_head ->* ephes_checked_if_pure ->* ephes_to_check ->* null
                        |                         |                  |
                       (1)                       (2)                (3)

    At the start of mark phase, (1) and (2) are empty.

    In mark phase:
      - the ephemerons in (1) have a data alive or none
        (nb: new ephemerons are added in this part by weak.c)
      - the ephemeron in (2):
         - is in any state if ephe_list_pure is false
         - otherwise have at least a white key or is white or its data is black or none.
           The third case can happen only using a set_* of weak.c
      - the ephemerons in (3) are in an unknown state and must be checked

    At the end of mark phase, (3) is empty and ephe_list_pure is true.
    The ephemeron in (1) and (2) will be cleaned (white keys and data
    replaced by none or the ephemeron is removed from the list if it is white)
    in clean phase.

(For (2) it is not the version in master but the version of the branch with the fix but I think it is better to talk on it). And I think it implies my assertions. The functions from weak where breaking the invariant.

@stedolan
Copy link
Contributor Author

Thanks, that helps, but I'm afraid I still have a bunch of questions!

In case (1), what does a data alive or none mean? Specifically, does "alive" mean white, black, gray or unknown? What happens if the data changes between alive and none?

During marking, are reachable ephemerons in state (1) already marked? How about state (2), assuming ephe_list_pure is true?

What exactly does ephe_list_pure mean?

It looks like ephemerons do not have the snapshot-at-the-beginning property that other values in OCaml have. That is, if an ephemeron and its key are both reachable at the start of marking (and so the ephemeron will not be cleaned and its data is reachable), it looks like it is possible for the data to be collected this cycle if unset_data is called before the ephemeron is marked. Is this true?

@bobot
Copy link
Contributor

bobot commented Apr 23, 2020

In case (1), what does a data alive or none mean? Specifically, does "alive" mean white, black, gray or unknown?

alive means:

  • outside the major_heap
  • otherwise, black or gray

I'm noticing that alive is not very clear , do you have a better word/notion? It is alive in the sense it will not be swept.

What happens if the data changes between alive and none?

Nothing. The invariant still old? Sorry, I'm not seeing the finality of the question.

During marking, are reachable ephemerons in state (1) already marked?

Interesting question, it should be written. I would say the ephemeron themselves could be not marked, it is not a problem. But if the information is needed: I have a doubt for ephemerons without data but ephemerons with data go in (1) only if marked.

How about state (2), assuming ephe_list_pure is true?

They can be marked or not. If they are marked but a key is not marked and they have a data they go to (2).

What exactly does ephe_list_pure mean?

It means that we don't have to look again in the set (2) to know if one ephemeron is alive and all is keys are alive, so we need to mark the data. ephe_list_pure is reset to false when a value is marked.

It looks like ephemerons do not have the snapshot-at-the-beginning property that other values in OCaml have.

I believe it is true. The datas, and all the values reachable from them, don't have this property. But it is not a problem because we have a read barrier for keys and datas, which mark the values.

@stedolan
Copy link
Contributor Author

Thanks for the explanations!

In case (1), what does a data alive or none mean? Specifically, does "alive" mean white, black, gray or unknown?

alive means:

  • outside the major_heap
  • otherwise, black or gray

I'm noticing that alive is not very clear , do you have a better word/notion? It is alive in the sense it will not be swept.

Thanks, I think I understand this now. Still not sure of the wording, though.

The characterisation of "will not be swept" makes sense, but I don't think this exactly matches the description above. I think it's possible to have an ephemeron in (1) with a white data by calling set_data with a white block. In order to call set_data with a white block, however, it needs to be a reachable block that will be marked before Phase_mark completes, so it won't be swept.

What happens if the data changes between alive and none?

Nothing. The invariant still old? Sorry, I'm not seeing the finality of the question.

During marking, are reachable ephemerons in state (1) already marked?

Interesting question, it should be written. I would say the ephemeron themselves could be not marked, it is not a problem. But if the information is needed: I have a doubt for ephemerons without data but ephemerons with data go in (1) only if marked.

I'd expect the answer to be the same regardless of whether the ephemeron has data or not, because if I can get an unmarked ephemeron into (1) without data, then I can call set_data afterwards and have an unmarked ephemeron in (1) with data.

How about state (2), assuming ephe_list_pure is true?

They can be marked or not. If they are marked but a key is not marked and they have a data they go to (2).

What exactly does ephe_list_pure mean?

It means that we don't have to look again in the set (2) to know if one ephemeron is alive and all is keys are alive, so we need to mark the data. ephe_list_pure is reset to false when a value is marked.

OK. It sounds like we should also change ephe_list_pure to false when a key changes from white to black via set_key? (It looks like there might be a fix along these lines in #9424?)

@bobot
Copy link
Contributor

bobot commented May 14, 2020

I think it's possible to have an ephemeron in (1) with a white data by calling set_data with a white block. In order to call set_data with a white block, however, it needs to be a reachable block that will be marked before Phase_mark completes, so it won't be swept

In #9424 it is now marked during set_data. Do you think it is possible not to mark it in that case, because it will always be marked before Phase_mark completes? So only set_key, blit_key, and blit_data are problematic.

OK. It sounds like we should also change ephe_list_pure to false when a key changes from white to black via set_key? (It looks like there might be a fix along these lines in #9424?)

Each time a value goes from white to grey/black ephe_list_pure is set to false, whatever the reason. #9424 just mark more, it doesn't change this logic.

@stedolan
Copy link
Contributor Author

Each time a value goes from white to grey/black ephe_list_pure is set to false, whatever the reason. #9424 just mark more, it doesn't change this logic.

It's possible for an ephemeron key to change from white to black without any block changing from white to black.

Suppose we have an ephemeron E, which is itself marked but has a white key K and a white value V. E is in set (2) and ephe_list_pure is true.

We take another key K', which is already black, and we use set_key to modify the key of E. Now E has a black key, a white value, is in set (2), and ephe_list_pure is still true. The invariant of (2) can be broken without the colour of any block changing.

(This might already be fixed in 9424?)

@bobot
Copy link
Contributor

bobot commented May 20, 2020

(This might already be fixed in 9424?)

Yes I believe it is. So I agree with you the current state is utterly broken. I'm talking about #9424 for the invariant. We can talk there for clarity if you prefer.

Is there a time frame for the fix? I don't have much time currently, but fixing bugs in ephemerons is still a priority for me.

bobot added a commit to bobot/ocaml that referenced this issue Oct 24, 2020
bobot added a commit to bobot/ocaml that referenced this issue Oct 24, 2020
bobot added a commit to bobot/ocaml that referenced this issue Oct 25, 2020
@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label May 24, 2021
@stedolan stedolan removed the Stale label May 24, 2021
bobot added a commit to bobot/ocaml that referenced this issue Jul 14, 2021
       There is a tradeoff between marking data that could be avoiding and
       inspecting too much keys. The choice here is to keep the computation
       linear in the argument (we only look at the keys that are blitted or set).

       rename ephe_list_pure into caml_ephe_list_pure

       Fixes ocaml#9391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants