Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Atlas test cluster to synchronize operations between astrolabe and workload executors #79

Open
prashantmital opened this issue Jun 30, 2020 · 0 comments

Comments

@prashantmital
Copy link
Contributor

prashantmital commented Jun 30, 2020

A configurable setting should be added to astrolabe that specifies a special namespace, e.g. sentinel_database.sentinel_collection in the test database, which will then be used by astrolabe and workload executors to synchronize their operations.

This might look something like this:

  • After astrolabe starts the workload executor, it writes the following record (with writeConcern: majority) to sentinel_database.sentinel_collection:
{ '_id': <run_id>, status: 'inProgress' }

Here, run_id is some identifier that is known to both astrolabe as well as the workload executor.

  • After each iteration of running all operations in the operations array (see https://mongodb-labs.github.io/drivers-atlas-testing/spec-test-format.html), the workload executor checks the sentinel_database.sentinel_collection collection (with readConcern: majority) for the record bearing _id: <run_id>. On seeing that the status is still inProgress, the workload executor continues onto the next iteration of running operations.

  • Once the maintenance has completed and astrolabe wants to tell the workload executor to quit, it updates the sentinel record (using writeConcern: majority) to:

{ '_id': <run_id>, status: 'done' }
  • On the next check, the workload executor sees that the status is now done, and it updates this record with execution statistics (using writeConcern: majority):
{ '_id': <run_id>, 'status': 'done', 'executionStats': {<field1>: <value1>, ...} }

After this, the workload executor exits.

  • Astrolabe waits on the $PID of the workload executor to exit. Once it has exited, it reads the execution statistics that are written by the workload executor.

Advantages of this approach

  1. No more signal handling - this has been a thorn in implementation and we are only up to 2 languages at this point. Workload executors are already equipped to talk to the Atlas deployment so we know that the approach proposed herein will be painless to implement.
  2. Workload executors can 'run anywhere' - since we no longer rely on platform-specific signals, we can coordinate between astrolabe and a workload executor no matter where they are running. This will be especially helpful in the context of running the workload executors inside containers where signals are not a viable option for process synchronization.
  3. Enable support for more complex communication - we can support more complex interactions between the workload executor and astrolabe with this design
  4. No more sentinel files - we no longer rely on files written by the workload executor to communicate execution stats.
  5. Use what you build - this one is pretty obvious (DBs are used to store state and communicate state between processes that might get partitioned).

Edge cases

  1. Workload executor is partitioned from the Atlas test cluster: this will make the W-E unable to read the sentinel document (could be caused, e.g. by a bug in the driver being tested or due to the Atlas test cluster going offline). This can be handled by using an appropriate timeout on the wait performed by astrolabe on the workload-executors $PID. If the workload executor does not stop running within the timeout, an error will be reported.
  2. Astrolabe is partitioned from the Atlas test cluster: this is possible even in the current design. If astrolabe cannot write the sentinel document at the start of a run, we can mark the run a system failure. If astrolabe cannot update the record when it needs to signal the W-E to stop, OR it cannot read the execution stats, we can mark this as a test failure as the maintenance or workload possibly broke something.

CC: @mbroadst @vincentkam

@prashantmital prashantmital changed the title Use test database instance to synchronize operations between astrolabe and workload executors Use Atlas test cluster to synchronize operations between astrolabe and workload executors Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant