Implement recovery of HTCondor spool (job queue) #2500
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implement high availability of the HTCondor job queue by using a managed instance group to provision a single VM with "stateful":
In a single zone, this configuration provides reliability against a failure of the access point VM by enabling the stateful disk to be mounted by a new VM when the MIG replaces it. In two zones, this is a proper "HA" configuration meaning reliability against a failure of the access points' zone. When configured to use two zones, the disk is replicated synchronously.
In the 2nd commit, we promote alignment of zones between the HTCondor pool components by using a different MIG target shape. In practice, I observe higher alignment in manual testing. Future work will identify a more robust solution (likely by having htcondor-setup output 2 zones)
Submission Checklist
Please take the following actions before submitting this pull request.