Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement recovery of HTCondor spool (job queue) #2500

Merged
merged 4 commits into from May 11, 2024

Conversation

tpdownes
Copy link
Member

@tpdownes tpdownes commented Apr 19, 2024

Implement high availability of the HTCondor job queue by using a managed instance group to provision a single VM with "stateful":

  • public or private IP address
  • disk containing the spool (job queue and related data)

In a single zone, this configuration provides reliability against a failure of the access point VM by enabling the stateful disk to be mounted by a new VM when the MIG replaces it. In two zones, this is a proper "HA" configuration meaning reliability against a failure of the access points' zone. When configured to use two zones, the disk is replicated synchronously.

In the 2nd commit, we promote alignment of zones between the HTCondor pool components by using a different MIG target shape. In practice, I observe higher alignment in manual testing. Future work will identify a more robust solution (likely by having htcondor-setup output 2 zones)

Submission Checklist

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cloud HPC Toolkit Contribution guidelines #

@tpdownes tpdownes added the do-not-merge Block merging of this PR label Apr 19, 2024
@tpdownes tpdownes self-assigned this May 6, 2024
@tpdownes tpdownes requested a review from rohitramu May 8, 2024 02:27
@tpdownes tpdownes assigned rohitramu and unassigned tpdownes May 8, 2024
@tpdownes tpdownes added release-module-improvements Added to release notes under the "Module Improvements" heading. and removed do-not-merge Block merging of this PR labels May 8, 2024
@tpdownes tpdownes marked this pull request as ready for review May 8, 2024 02:30
@rohitramu rohitramu assigned tpdownes and unassigned rohitramu May 9, 2024
@tpdownes tpdownes force-pushed the htcondor_ap_stateful branch 3 times, most recently from df5ca11 to 516f181 Compare May 10, 2024 20:37
Implement high availability of the HTCondor job queue by using a managed
instance group to provision a single VM with "stateful":

- public or private IP address
- disk containing the spool (job queue and related data)

In a single zone, "HA" means reliability against a failure of the access
point VM. In two zones, "HA" means reliability against a failure of the
access points' zone. When configured to use two zones, the disk is
replicated synchronously.

Future work: when using 2 zones, identify solution for aligning zones
between HTCondor pool components (Central Manager, Execute Points)
…mponents

Use ANY_SINGLE_ZONE to increase the odds that HTCondor core components
are provisioned in the same zone by the regional Managed Instance Group.
Reasoning:

1. This target shape prioritizes the zone with the most reservations
   (therefore reservations can guide alignment)
2. The next priority is resource availability within the zone, which
   should also serve to align resources.
3. Align machine_type for Central Manager and Access Point
Replace a fixed-wait sleep with the built-in feature that Terraform
offers to wait for instance creation in MIGs.
@tpdownes tpdownes merged commit 8920ee0 into GoogleCloudPlatform:develop May 11, 2024
9 of 47 checks passed
@tpdownes tpdownes deleted the htcondor_ap_stateful branch May 11, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants