Skip to content

Commit

Permalink
Merge pull request #22122 from rkern/doc/sequence-of-seeds
Browse files Browse the repository at this point in the history
DOC: Explain how to use sequences of integers as seeds.
  • Loading branch information
charris committed Aug 18, 2022
2 parents e8c4559 + 3e0bd19 commit 75ae7be
Show file tree
Hide file tree
Showing 2 changed files with 84 additions and 2 deletions.
3 changes: 2 additions & 1 deletion doc/source/reference/random/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,9 +216,10 @@ Parallel Generation
~~~~~~~~~~~~~~~~~~~

The included generators can be used in parallel, distributed applications in
one of three ways:
a number of ways:

* :ref:`seedsequence-spawn`
* :ref:`sequence-of-seeds`
* :ref:`independent-streams`
* :ref:`parallel-jumped`

Expand Down
83 changes: 82 additions & 1 deletion doc/source/reference/random/parallel.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Parallel Random Number Generation
=================================

There are three strategies implemented that can be used to produce
There are four main strategies implemented that can be used to produce
repeatable pseudo-random numbers across multiple processes (local
or distributed).

Expand Down Expand Up @@ -109,6 +109,87 @@ territory ([2]_).
.. _`not unique to numpy`: https://www.iro.umontreal.ca/~lecuyer/myftp/papers/parallel-rng-imacs.pdf


.. _sequence-of-seeds:

Sequence of Integer Seeds
-------------------------

As discussed in the previous section, `~SeedSequence` can not only take an
integer seed, it can also take an arbitrary-length sequence of (non-negative)
integers. If one exercises a little care, one can use this feature to design
*ad hoc* schemes for getting safe parallel PRNG streams with similar safety
guarantees as spawning.

For example, one common use case is that a worker process is passed one
root seed integer for the whole calculation and also an integer worker ID (or
something more granular like a job ID, batch ID, or something similar). If
these IDs are created deterministically and uniquely, then one can derive
reproducible parallel PRNG streams by combining the ID and the root seed
integer in a list.

.. code-block:: python
# default_rng() and each of the BitGenerators use SeedSequence underneath, so
# they all accept sequences of integers as seeds the same way.
from numpy.random import default_rng
def worker(root_seed, worker_id):
rng = default_rng([worker_id, root_seed])
# Do work ...
root_seed = 0x8c3c010cb4754c905776bdac5ee7501
results = [worker(root_seed, worker_id) for worker_id in range(10)]
.. end_block
This can be used to replace a number of unsafe strategies that have been used
in the past which try to combine the root seed and the ID back into a single
integer seed value. For example, it is common to see users add the worker ID to
the root seed, especially with the legacy `~RandomState` code.

.. code-block:: python
# UNSAFE! Do not do this!
worker_seed = root_seed + worker_id
rng = np.random.RandomState(worker_seed)
.. end_block
It is true that for any one run of a parallel program constructed this way,
each worker will have distinct streams. However, it is quite likely that
multiple invocations of the program with different seeds will get overlapping
sets of worker seeds. It is not uncommon (in the author's self-experience) to
change the root seed merely by an increment or two when doing these repeat
runs. If the worker seeds are also derived by small increments of the worker
ID, then subsets of the workers will return identical results, causing a bias
in the overall ensemble of results.

Combining the worker ID and the root seed as a list of integers eliminates this
risk. Lazy seeding practices will still be fairly safe.

This scheme does require that the extra IDs be unique and deterministically
created. This may require coordination between the worker processes. It is
recommended to place the varying IDs *before* the unvarying root seed.
`~SeedSequence.spawn` *appends* integers after the user-provided seed, so if
you might be mixing both this *ad hoc* mechanism and spawning, or passing your
objects down to library code that might be spawning, then it is a little bit
safer to prepend your worker IDs rather than append them to avoid a collision.

.. code-block:: python
# Good.
worker_seed = [worker_id, root_seed]
# Less good. It will *work*, but it's less flexible.
worker_seed = [root_seed, worker_id]
.. end_block
With those caveats in mind, the safety guarantees against collision are about
the same as with spawning, discussed in the previous section. The algorithmic
mechanisms are the same.


.. _independent-streams:

Independent Streams
Expand Down

0 comments on commit 75ae7be

Please sign in to comment.