Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplementation of Random using an LXM pseudo-random number generator #10742

Merged
merged 6 commits into from Jan 17, 2022

Conversation

xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Nov 1, 2021

[ I am not proposing this PR for inclusion in OCaml 4.14. It is intended for Multicore OCaml and OCaml 5.00. ]

[ Now that Multicore OCaml is merged and the domain-local-storage interface is what we need, it's time to change PRNG ! ]

Following extensive experimentation (https://github.com/xavierleroy/pringo/), discussions (ocaml/RFCs#28), and a previous attempt based on a Xoshiro generator (#10701), this PR reimplements the Random standard library module using a pseudo-random generator from the LXM family, as described by Steele and Vigna in their OOPSLA 2021 paper. I used the L64X128 variant, shown in figure 1 of the paper.

The main benefit of the new PRNG is that it supports "splitting", i.e. deterministically drawing a new PRNG state from a given PRNG state, getting two statistically-independent PRNGs that can be splitted again later. (The current Random implementation does not support this.) Splitting will come very handy for Multicore OCaml, enabling each domain to maintain its own PRNG state in domain-local storage, with splitting being used to give a fresh PRNG state when a new domain is created. Splitting has other uses for Quickcheck-style generation of random functions and streams.

The L64X128 generator has a large internal state (192 bits + 64 bits for diversification at splits), making reseeding unnecessary. Steele and Vigna's extensive statistical analysis shows no known weaknesses.

Compared with the current implementation of Random, we get nice performance improvements on 64-bit platforms and not so nice performance degradation on 32-bit platforms.

(Times are in seconds for 50000000 repetitions, unless indicated.)

x86 64 bits

  4.13  LXM

  0.30  0.21  bit
  0.28  0.20  bits
  0.42  0.31  int 0xFFEE
  0.71  0.40  int32 0xFFEEDD
  0.97  0.32  int64 0xFFEEDDCCAA
  0.72  0.31  float 1.0
 11.35  0.02  full_init 3 (/100)

ARMv7 32 bits

  4.13  LXM

  1.75  4.84  bit
  1.53  2.78  bits
  4.63  6.16  int 0xFFEE
  5.12  6.89  int32 0xFFEEDD
 30.92 20.51  int64 0xFFEEDDCCAA
  3.36  8.30  float 1.0
 47.40  0.13  full_init 3 (/100)

@xavierleroy
Copy link
Contributor Author

For the same reasons explained in #10701 (comment), the internal state of L64X128 (four 64-bit integers) is represented as a 1-dimensional big array of int64. This ensures proper alignment of the state on 32-bit platforms and gives us marshaling and other generic operations for free. If this choice is a problem with js-of-ocaml, an opaque representation (fully implemented in C) could be used instead.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising, thanks!

I checked that the code is as in the paper.

runtime/prng.c Outdated Show resolved Hide resolved
runtime/prng.c Outdated
struct LXM_state * st = LXM_val(v);
st->a = i1 | 1; /* must be odd */
st->x[0] = i2 != 0 ? i2 : 1; /* must be nonzero */
st->x[1] = i3 != 0 ? i3 : 2; /* must be nonzero */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper writes something different about the x parameter initialization.

The constructor then forces the additive parameter to be odd by setting its low-order
bit to 1, but beyond that no additional vetting of the additive parameter (to reject “weak valuesž
[Steele, Lea, and Flood 2014]) is necessary. In the unlikely circumstance that the state for the XBG
subgenerator is entirely 0, it is necessary to force it to be nonzero; this could be done by making
additional calls to nextLong() or nextInt(), but it is easier (and acceptable in practice, because it
happens so rarely) to derive the new XBG state from the new LCG state.

I'm not sure what they mean by "derive the new XBG state from the new LCG state", but maybe they are thinking of setting one of x[0] or x[1] to the odd st->a value? (Note: the code currently suggests that neither of x0 and x1 can be zero, but in fact the paper says that having only one of them non-zero is enough.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "x[0] and x[1] must not be both 0", I used the more conservative "none of x[0] and x[1] should be zero" because that's what the Xoshiro papers recommend, if I remember correctly.

For states that are created by splitting, i2 and i3 were obtained by pseudo-random number generation, so the probability that they are 0 is extremely low, and using 1 and 2 in this unlikely case is unnoticeable.

What they mean by "derive the new XBG state from the new LCG state" is better explained in the Xoshiro papers. To initialize an XBG from a single 64-bit integer seed, Vigna suggests to use... a SplitMix PRNG, initialized with this seed, and run N times to get N good-looking numbers to initialize the N-word state of the XBG.

For states that are created by initialization from an array of integers (function Random.full_init and related functions), the situation is more delicate. Sometimes we are given a single integer (function Random.init), in which case Vigna's suggestion would apply. But sometimes we are given 12 bytes with good entropy (function Random.self_init when /dev/urandom can be read) and we should "spread out" this 96-bit entropy on the 4 components of the L64X128 state.

After some thinking, I implemented a cute solution where the integer array is hashed (as a string of 64-bit "characters") using a hash function that produces a 256-bit result, which is then taken as the initial L64X128 state. It is hoped that the result incorporates all the entropy present in the integer array, up to 256 bits, yet produces four 64-bit numbers that don't have strong correlations and "look random already" (whatever that means).

This solution is commit 83c2830 in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the solution and I think it should work fine indeed. Two remarks (that don't suggest any change):

  • We would also have the choice of asking /dev/urandom for more input (but that doesn't work when /dev/urandom is not available, where we generate a fixed amount of bytes with bad entropy)
  • We already have a hash function implemented in the runtime, Murmurhash 3 (the 32bit-producing version). I understand that you need a 256-large output, so FNV1a fits the bill better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash functions come in all sizes and shapes :-) However, I woke up in the middle of the night with the realization that FNV1a doesn't avalanche much on short inputs (e.g. a single integer).

So, I just pushed yet another proposal that uses good old MD5 as the hashing function. It only produces 128 bits of output, so I call it twice with some diversification of the input. Now, it should avalanche perfectly, while preserving up to 128 bits of entropy present in the seed. It's a bit slower than the previous implementation, but still significantly faster than the one in OCaml today.

As an additional benefit, all the initialization code is now done in OCaml, reducing the C code in runtime/prng.c to the one speed-critical function next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further remark 1: yes, we need a 256-bit (or more) crypto hash function to replace MD5 in Digest. It's on my to do list.

Further remark 2: yes, we should revisit the PRNG initialization from /dev/urandom. This will come next.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding next, I found some time again to try a pure OCaml implementation (using bigarrays this time), and in optimal conditions (amd64 backend, native mode) it outperforms the C version by ~5% in microbenchmarks.
It is less readable than the C version though, and of course it will probably be significantly slower in bytecode and 32-bit native backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not surprised you're getting good performance from a pure OCaml implementation on x86-64: there is not much to optimize in this code, and the only optimization OCaml misses is the recognition of the "rotate" instruction. Note however that I forgot to declare the C primitive as "noalloc". This would regain a few percents...

stdlib/random.ml Outdated Show resolved Hide resolved
The implementation in stdlib/random.ml is going to change.
@xavierleroy
Copy link
Contributor Author

Now that #10887 is merged, it's time to change PRNG ! Please review...

…M family

As described in the paper [LXM: Better Splittable Pseudorandom Number
Generators (and Almost as Fast)](https://doi.org/10.1145/3485525) by
Guy L. Steele Jr. and Sebastiano Vigna, proceedings of OOPSLA 2021.

This provides a fast, state-of-the-art PRNG with full support for splitting.

Initialization from a seed (an array of integers) proceeds by
serializing the array to a byte sequence (each seed element occupies 8
bytes in little-endian representation) then hashing twice using the
MD5 hash function and two different suffixes, obtaining 256 bits of
initialization data.

More tests were added in testsuite/tests/lib-random, and the lone test
testsuite/tests/basic-more/testrandom.ml was moved there.

let _ =
let a = Array1.of_array Int64 C_layout [| 1L; 2L; 3L; 4L |] in
(* Violate abstraction of type Random.State.t to manipulate state directly *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we should provide a function to set the "raw" state in the library.

Pros:

  • no Obj.magic here
  • there might be cases where users want to avoid paying the cost of the additional mixing operations

Cons:

  • this breaks the generator if an all-zero array is passed
  • we have to leak details about the generator algorithm to explain this (which raw seeds to avoid)

In the end I think the current approach is sensible, and no change is required.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks!

I checked that the implementation is as described in the paper.

It's nice that you have test vectors from a reference implementation.
Would you by chance have a pointer to this implementation? I couldn't find a pointer from the paper itself. I wanted to see if they used the same "bad state protection" tricks that you use in set (this is probably not checked by the test vectors you chose, as the trick are put to use with probability almost-0).

My review comments are mostly comments. There is a question about the usage of max_int (and its portability), and possibly a suggestion to simplify full_int.

z = (z ^ (z >> 32)) * 0xdaba0b6eb09322e3;
z = (z ^ (z >> 32));
/* LCG update */
st->s = st->s * M + st->a;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paper uses M * s + a, not sure why there is a slight difference here.

With multiple domains, each domain has its own generator that evolves
independently of the generators of other domains. When a domain is
created, its generator is initialized by splitting the state
of the generator associated with the parent domain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we should have fixed this comment in the previous PR already (#10887). Thanks for noticing.

#include "caml/bigarray.h"
#include "caml/mlvalues.h"

/* The L64X128 member of the LXM family. Taken from figure 1 in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the mixing function (lea64) should be mentioned as well. (The LXM authors describe "better mixing functions" as interesting future work, so it may be that in ten years what people understand as "The L64X128 member" is not that one anymore.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to figure 1 in the paper makes it clear which variant is being implemented.

in
(bpos, max_int)
in
let r = Int64.to_int (next s) land max_int in
let v = r mod n in
if r - v > max_int - n + 1 then int63aux s n else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that we can simplify these functions so much with a full int64 generator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But: I'm confused by the use of max_int here (already present before this change), what happens when running OCaml 32bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slightly confusing code predates this PR. The idea is that full_int calls int63aux only when bound > 0x3FFFFFFF, which cannot happen on a 32 bit platform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add an assert (Sys.int_size >= 63) here to make this precondition explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because there's also the js-of-ocaml case (Sys.int_size = 32). This was discussed at length in #9489; I won't have this discussion again.

@@ -126,13 +102,15 @@ module State = struct
else
intaux s bound
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify the code here by calling int63aux unconditionally, right?

(This would also reduce the average number of retries, in a noticeable way for bounds slightly above 2^29.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted full_int to return the same values as int when the bound is small enough.

let v = Int32.rem r n in
if Int32.sub r v > Int32.add (Int32.sub Int32.max_int n) 1l
if Int32.(sub r v > add (sub max_int n) 1l)
then int32aux s n
else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, drawing in the full int64 range would also reduce the number of retries in some cases, but it would perform worse in 32bit mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.


(* Return 30 random bits as an integer 0 <= x < 1073741824 *)
let bits s =
Int64.to_int (next s) land 0x3FFF_FFFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call this bits30 for consistency, as in your Pringo API, and then have bits as a backward-compatibility alias.

let rec rawfloat s =
let b = next s in
let n = Int64.shift_right_logical b 11 in
if n <> 0L then Int64.to_float n *. 0x1.p-53 else rawfloat s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had a discussion about how to draw floats recently, probably with @silene, and there was an explanation for why one approach was more desirable than the other. Could the reasons for using this one be mentioned in a comment?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was invoked in this thread, I will take the chance to point out that the comment is a truism. I guess it was meant to read "Return a float 0 < x < 1 that is a multiple of 2^-53".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion was in your RFC, ocaml/RFCs#28 (comment)

The bottom line for the discussion is that the current approach (multiples of 2^-53) is very simple and pretty good overall. More sophisticated algorithms can produce any representable FP number between 0.0 and 1.0 with the appropriate probability, but they are more complex and I'm not convinced there is a benefit.

Agreed about the tautological nature of the comment. (But at least it's not wrong!) Will reword.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, since the only user of rawfloat is float, there is no point in excluding zero and thus to write a recursive function. Or is this an undocumented feature of float that, for normal values of bound, it returns a non-zero value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also discussed in the RFC: ocaml/RFCs#28 (comment) Eventually, I'd like to export rawfloat (it's called uniform in PRINGO) with explicit guarantees that it returns a random FP between 0 and 1 excluded.

Array1.unsafe_set s 0 (Int64.logor i1 1L); (* must be odd *)
Array1.unsafe_set s 1 i2;
Array1.unsafe_set s 2 (if i3 <> 0L then i3 else 1L); (* must not be 0 *)
Array1.unsafe_set s 3 (if i4 <> 0L then i4 else 2L) (* must not be 0 *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why those unsafe_ are correct is thanks to type abstraction: the module exports an abstract state type t, and the three exported functions that can build a t (make, make_self_init, copy) guarantee that the size is exactly 4.

stdlib/random.ml Outdated
Bytes.set_int64_le b (i * 8) (Int64.of_int seed.(i))
done;
Bytes.set b (n * 8) '\x01';
let d1 = Bytes.unsafe_of_string (Digest.bytes b) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use Bytes.unsafe_of_string if we use String.get_int64_le below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, thanks! I was not aware that get_* functions had been added to String...

@hhugo
Copy link
Contributor

hhugo commented Jan 16, 2022

If this choice is a problem with js-of-ocaml, an opaque representation (fully implemented in C) could be used instead.

The current implementation should work fine with jsoo once next is implemented.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the code is correct, well-tested, and that the feature is useful.

(It's nice in particular that, on 64-bit systems, we are able to get a better generator, with extra features, and a performance improvement.)

xavierleroy added a commit to xavierleroy/ocaml that referenced this pull request Jan 16, 2022
@xavierleroy
Copy link
Contributor Author

t's nice that you have test vectors from a reference implementation. Would you by chance have a pointer to this implementation?

It's a cut-and-paste from figure 1 in the paper, plus a main function.

@gasche
Copy link
Member

gasche commented Jan 16, 2022

Ah, so we don't know what choice they made to sanitize the state-initialization input. No big deal.

(check-typo complains that make alldepend needs to re-run now that you depend on String as well.)

@gasche
Copy link
Member

gasche commented Jan 17, 2022

I went ahead and merged -- I consider that people had ample time to voice any concern since ocaml/RFCs#28 was opened in early September.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants