Use Atlas test cluster to synchronize operations between astrolabe and workload executors #79

prashantmital · 2020-06-30T04:11:04Z

A configurable setting should be added to astrolabe that specifies a special namespace, e.g. sentinel_database.sentinel_collection in the test database, which will then be used by astrolabe and workload executors to synchronize their operations.

This might look something like this:

After astrolabe starts the workload executor, it writes the following record (with writeConcern: majority) to sentinel_database.sentinel_collection:

{ '_id': <run_id>, status: 'inProgress' }

Here, run_id is some identifier that is known to both astrolabe as well as the workload executor.

After each iteration of running all operations in the operations array (see https://mongodb-labs.github.io/drivers-atlas-testing/spec-test-format.html), the workload executor checks the sentinel_database.sentinel_collection collection (with readConcern: majority) for the record bearing _id: <run_id>. On seeing that the status is still inProgress, the workload executor continues onto the next iteration of running operations.
Once the maintenance has completed and astrolabe wants to tell the workload executor to quit, it updates the sentinel record (using writeConcern: majority) to:

{ '_id': <run_id>, status: 'done' }

On the next check, the workload executor sees that the status is now done, and it updates this record with execution statistics (using writeConcern: majority):

{ '_id': <run_id>, 'status': 'done', 'executionStats': {<field1>: <value1>, ...} }

After this, the workload executor exits.

Astrolabe waits on the $PID of the workload executor to exit. Once it has exited, it reads the execution statistics that are written by the workload executor.

Advantages of this approach

No more signal handling - this has been a thorn in implementation and we are only up to 2 languages at this point. Workload executors are already equipped to talk to the Atlas deployment so we know that the approach proposed herein will be painless to implement.
Workload executors can 'run anywhere' - since we no longer rely on platform-specific signals, we can coordinate between astrolabe and a workload executor no matter where they are running. This will be especially helpful in the context of running the workload executors inside containers where signals are not a viable option for process synchronization.
Enable support for more complex communication - we can support more complex interactions between the workload executor and astrolabe with this design
No more sentinel files - we no longer rely on files written by the workload executor to communicate execution stats.
Use what you build - this one is pretty obvious (DBs are used to store state and communicate state between processes that might get partitioned).

Edge cases

Workload executor is partitioned from the Atlas test cluster: this will make the W-E unable to read the sentinel document (could be caused, e.g. by a bug in the driver being tested or due to the Atlas test cluster going offline). This can be handled by using an appropriate timeout on the wait performed by astrolabe on the workload-executors $PID. If the workload executor does not stop running within the timeout, an error will be reported.
Astrolabe is partitioned from the Atlas test cluster: this is possible even in the current design. If astrolabe cannot write the sentinel document at the start of a run, we can mark the run a system failure. If astrolabe cannot update the record when it needs to signal the W-E to stop, OR it cannot read the execution stats, we can mark this as a test failure as the maintenance or workload possibly broke something.

CC: @mbroadst @vincentkam

The text was updated successfully, but these errors were encountered:

prashantmital changed the title ~~Use test database instance to synchronize operations between astrolabe and workload executors~~ Use Atlas test cluster to synchronize operations between astrolabe and workload executors Jun 30, 2020

prashantmital mentioned this issue Jun 30, 2020

Add alternative mechanism to signals for stopping workload executors #77

Open

prashantmital mentioned this issue Jan 25, 2021

DRIVERS-828 comprehensive Atlas testing #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Atlas test cluster to synchronize operations between astrolabe and workload executors #79

Use Atlas test cluster to synchronize operations between astrolabe and workload executors #79

prashantmital commented Jun 30, 2020 •

edited

Use Atlas test cluster to synchronize operations between astrolabe and workload executors #79

Use Atlas test cluster to synchronize operations between astrolabe and workload executors #79

Comments

prashantmital commented Jun 30, 2020 • edited

prashantmital commented Jun 30, 2020 •

edited