Implement recovery of HTCondor spool (job queue) #2500

tpdownes · 2024-04-19T19:16:27Z

Implement high availability of the HTCondor job queue by using a managed instance group to provision a single VM with "stateful":

public or private IP address
disk containing the spool (job queue and related data)

In a single zone, this configuration provides reliability against a failure of the access point VM by enabling the stateful disk to be mounted by a new VM when the MIG replaces it. In two zones, this is a proper "HA" configuration meaning reliability against a failure of the access points' zone. When configured to use two zones, the disk is replicated synchronously.

In the 2nd commit, we promote alignment of zones between the HTCondor pool components by using a different MIG target shape. In practice, I observe higher alignment in manual testing. Future work will identify a more robust solution (likely by having htcondor-setup output 2 zones)

Submission Checklist

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cloud HPC Toolkit Contribution guidelines #

community/modules/scheduler/htcondor-access-point/README.md

Implement high availability of the HTCondor job queue by using a managed instance group to provision a single VM with "stateful": - public or private IP address - disk containing the spool (job queue and related data) In a single zone, "HA" means reliability against a failure of the access point VM. In two zones, "HA" means reliability against a failure of the access points' zone. When configured to use two zones, the disk is replicated synchronously. Future work: when using 2 zones, identify solution for aligning zones between HTCondor pool components (Central Manager, Execute Points)

…mponents Use ANY_SINGLE_ZONE to increase the odds that HTCondor core components are provisioned in the same zone by the regional Managed Instance Group. Reasoning: 1. This target shape prioritizes the zone with the most reservations (therefore reservations can guide alignment) 2. The next priority is resource availability within the zone, which should also serve to align resources. 3. Align machine_type for Central Manager and Access Point

Replace a fixed-wait sleep with the built-in feature that Terraform offers to wait for instance creation in MIGs.

tpdownes added the do-not-merge Block merging of this PR label Apr 19, 2024

tpdownes self-assigned this May 6, 2024

tpdownes force-pushed the htcondor_ap_stateful branch from 818119c to 1c19b66 Compare May 7, 2024 23:36

tpdownes requested a review from rohitramu May 8, 2024 02:27

tpdownes assigned rohitramu and unassigned tpdownes May 8, 2024

tpdownes added release-module-improvements Added to release notes under the "Module Improvements" heading. and removed do-not-merge Block merging of this PR labels May 8, 2024

tpdownes marked this pull request as ready for review May 8, 2024 02:30

rohitramu approved these changes May 9, 2024

View reviewed changes

community/modules/scheduler/htcondor-access-point/README.md Outdated Show resolved Hide resolved

rohitramu assigned tpdownes and unassigned rohitramu May 9, 2024

tpdownes force-pushed the htcondor_ap_stateful branch 3 times, most recently from df5ca11 to 516f181 Compare May 10, 2024 20:37

tpdownes added 4 commits May 11, 2024 15:01

Update HTCondor module usage of CFT to be fully TPG 5.x compatible

ebadb02

HTCondor improve reliability on first provision

1eeab10

Replace a fixed-wait sleep with the built-in feature that Terraform offers to wait for instance creation in MIGs.

tpdownes force-pushed the htcondor_ap_stateful branch from 9266a58 to 1eeab10 Compare May 11, 2024 20:01

tpdownes merged commit 8920ee0 into GoogleCloudPlatform:develop May 11, 2024
9 of 47 checks passed

tpdownes deleted the htcondor_ap_stateful branch May 11, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement recovery of HTCondor spool (job queue) #2500

Implement recovery of HTCondor spool (job queue) #2500

tpdownes commented Apr 19, 2024 •

edited

Implement recovery of HTCondor spool (job queue) #2500

Implement recovery of HTCondor spool (job queue) #2500

Conversation

tpdownes commented Apr 19, 2024 • edited

Submission Checklist

tpdownes commented Apr 19, 2024 •

edited