Skip to content

Commit

Permalink
Merge pull request #2500 from tpdownes/htcondor_ap_stateful
Browse files Browse the repository at this point in the history
Implement recovery of HTCondor spool (job queue)
  • Loading branch information
tpdownes committed May 11, 2024
2 parents 3423904 + 1eeab10 commit 8920ee0
Show file tree
Hide file tree
Showing 12 changed files with 197 additions and 116 deletions.
5 changes: 5 additions & 0 deletions community/examples/htc-htcondor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ vars:
deployment_name: htcondor-pool
region: us-central1
zone: us-central1-c
zones:
- $(vars.zone)
- us-central1-f
disk_size_gb: 100
new_image:
family: htcondor-10x
Expand Down Expand Up @@ -129,6 +132,8 @@ deployment_groups:
default_mig_id: $(htcondor_execute_point.mig_id)
enable_public_ips: true
instance_image: $(vars.new_image)
enable_high_availability: true
spool_disk_size_gb: 200 # required for regional HA
outputs:
- access_point_ips
- access_point_name
4 changes: 2 additions & 2 deletions community/modules/compute/htcondor-execute-point/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,8 +210,8 @@ limitations under the License.

| Name | Source | Version |
|------|--------|---------|
| <a name="module_execute_point_instance_template"></a> [execute\_point\_instance\_template](#module\_execute\_point\_instance\_template) | terraform-google-modules/vm/google//modules/instance_template | ~> 10.1.1 |
| <a name="module_mig"></a> [mig](#module\_mig) | github.com/terraform-google-modules/terraform-google-vm//modules/mig | aea74d1 |
| <a name="module_execute_point_instance_template"></a> [execute\_point\_instance\_template](#module\_execute\_point\_instance\_template) | terraform-google-modules/vm/google//modules/instance_template | 10.1.1 |
| <a name="module_mig"></a> [mig](#module\_mig) | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.32.1&depth=1 |

## Resources
Expand Down
6 changes: 4 additions & 2 deletions community/modules/compute/htcondor-execute-point/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ module "startup_script" {

module "execute_point_instance_template" {
source = "terraform-google-modules/vm/google//modules/instance_template"
version = "~> 10.1.1"
version = "10.1.1"

name_prefix = local.name_prefix
project_id = var.project_id
Expand All @@ -163,7 +163,9 @@ module "execute_point_instance_template" {
}

module "mig" {
source = "github.com/terraform-google-modules/terraform-google-vm//modules/mig?ref=aea74d1"
source = "terraform-google-modules/vm/google//modules/mig"
version = "10.1.1"

project_id = var.project_id
region = var.region
distribution_policy_target_shape = var.distribution_policy_target_shape
Expand Down
23 changes: 14 additions & 9 deletions community/modules/scheduler/htcondor-access-point/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,29 +107,31 @@ limitations under the License.
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.1 |
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 3.83 |
| <a name="requirement_time"></a> [time](#requirement\_time) | ~> 0.9 |
| <a name="requirement_random"></a> [random](#requirement\_random) | ~> 3.6 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | >= 3.83 |
| <a name="provider_time"></a> [time](#provider\_time) | ~> 0.9 |
| <a name="provider_random"></a> [random](#provider\_random) | ~> 3.6 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_access_point_instance_template"></a> [access\_point\_instance\_template](#module\_access\_point\_instance\_template) | github.com/terraform-google-modules/terraform-google-vm//modules/instance_template | 84d7959 |
| <a name="module_htcondor_ap"></a> [htcondor\_ap](#module\_htcondor\_ap) | github.com/terraform-google-modules/terraform-google-vm//modules/mig | aea74d1 |
| <a name="module_access_point_instance_template"></a> [access\_point\_instance\_template](#module\_access\_point\_instance\_template) | github.com/terraform-google-modules/terraform-google-vm//modules/instance_template | 73dc845 |
| <a name="module_htcondor_ap"></a> [htcondor\_ap](#module\_htcondor\_ap) | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.32.1&depth=1 |

## Resources

| Name | Type |
|------|------|
| [google_compute_disk.spool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk) | resource |
| [google_compute_region_disk.spool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_disk) | resource |
| [google_storage_bucket_object.ap_config](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket_object) | resource |
| [time_sleep.mig_warmup](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource |
| [random_shuffle.zones](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/shuffle) | resource |
| [google_compute_image.htcondor](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image) | data source |
| [google_compute_instance.ap](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_instance) | data source |
| [google_compute_region_instance_group.ap](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_region_instance_group) | data source |
Expand All @@ -145,16 +147,17 @@ limitations under the License.
| <a name="input_central_manager_ips"></a> [central\_manager\_ips](#input\_central\_manager\_ips) | List of IP addresses of HTCondor Central Managers | `list(string)` | n/a | yes |
| <a name="input_default_mig_id"></a> [default\_mig\_id](#input\_default\_mig\_id) | Default MIG ID for HTCondor jobs; if unset, jobs must specify MIG id | `string` | `""` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | HPC Toolkit deployment name. HTCondor cloud resource names will include this value. | `string` | n/a | yes |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `null` | no |
| <a name="input_distribution_policy_target_shape"></a> [distribution\_policy\_target\_shape](#input\_distribution\_policy\_target\_shape) | Target shape acoss zones for instance group managing high availability of access point | `string` | `"BALANCED"` | no |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `32` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Boot disk size in GB | `string` | `"pd-balanced"` | no |
| <a name="input_distribution_policy_target_shape"></a> [distribution\_policy\_target\_shape](#input\_distribution\_policy\_target\_shape) | Target shape acoss zones for instance group managing high availability of access point | `string` | `"ANY_SINGLE_ZONE"` | no |
| <a name="input_enable_high_availability"></a> [enable\_high\_availability](#input\_enable\_high\_availability) | Provision HTCondor access point in high availability mode | `bool` | `false` | no |
| <a name="input_enable_oslogin"></a> [enable\_oslogin](#input\_enable\_oslogin) | Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. | `string` | `"ENABLE"` | no |
| <a name="input_enable_public_ips"></a> [enable\_public\_ips](#input\_enable\_public\_ips) | Enable Public IPs on the access points | `bool` | `false` | no |
| <a name="input_enable_shielded_vm"></a> [enable\_shielded\_vm](#input\_enable\_shielded\_vm) | Enable the Shielded VM configuration (var.shielded\_instance\_config). | `bool` | `false` | no |
| <a name="input_htcondor_bucket_name"></a> [htcondor\_bucket\_name](#input\_htcondor\_bucket\_name) | Name of HTCondor configuration bucket | `string` | n/a | yes |
| <a name="input_instance_image"></a> [instance\_image](#input\_instance\_image) | Custom VM image with HTCondor and Toolkit support installed."<br><br>Expected Fields:<br>name: The name of the image. Mutually exclusive with family.<br>family: The image family to use. Mutually exclusive with name.<br>project: The project where the image is hosted. | `map(string)` | n/a | yes |
| <a name="input_labels"></a> [labels](#input\_labels) | Labels to add to resources. List key, value pairs. | `map(string)` | n/a | yes |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for HTCondor central managers | `string` | `"c2-standard-4"` | no |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for HTCondor central managers | `string` | `"n2-standard-4"` | no |
| <a name="input_metadata"></a> [metadata](#input\_metadata) | Metadata to add to HTCondor central managers | `map(string)` | `{}` | no |
| <a name="input_mig_id"></a> [mig\_id](#input\_mig\_id) | List of Managed Instance Group IDs containing execute points in this pool (supplied by htcondor-execute-point module) | `list(string)` | `[]` | no |
| <a name="input_network_self_link"></a> [network\_self\_link](#input\_network\_self\_link) | The self link of the network in which the HTCondor central manager will be created. | `string` | `null` | no |
Expand All @@ -163,10 +166,12 @@ limitations under the License.
| <a name="input_region"></a> [region](#input\_region) | Default region for creating resources | `string` | n/a | yes |
| <a name="input_service_account_scopes"></a> [service\_account\_scopes](#input\_service\_account\_scopes) | Scopes by which to limit service account attached to central manager. | `set(string)` | <pre>[<br> "https://www.googleapis.com/auth/cloud-platform"<br>]</pre> | no |
| <a name="input_shielded_instance_config"></a> [shielded\_instance\_config](#input\_shielded\_instance\_config) | Shielded VM configuration for the instance (must set var.enabled\_shielded\_vm) | <pre>object({<br> enable_secure_boot = bool<br> enable_vtpm = bool<br> enable_integrity_monitoring = bool<br> })</pre> | <pre>{<br> "enable_integrity_monitoring": true,<br> "enable_secure_boot": true,<br> "enable_vtpm": true<br>}</pre> | no |
| <a name="input_spool_disk_size_gb"></a> [spool\_disk\_size\_gb](#input\_spool\_disk\_size\_gb) | Boot disk size in GB | `number` | `32` | no |
| <a name="input_spool_disk_type"></a> [spool\_disk\_type](#input\_spool\_disk\_type) | Boot disk size in GB | `string` | `"pd-ssd"` | no |
| <a name="input_spool_parent_dir"></a> [spool\_parent\_dir](#input\_spool\_parent\_dir) | HTCondor access point configuration SPOOL will be set to subdirectory named "spool" | `string` | `"/var/lib/condor"` | no |
| <a name="input_subnetwork_self_link"></a> [subnetwork\_self\_link](#input\_subnetwork\_self\_link) | The self link of the subnetwork in which the HTCondor central manager will be created. | `string` | `null` | no |
| <a name="input_update_policy"></a> [update\_policy](#input\_update\_policy) | Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) | `string` | `"OPPORTUNISTIC"` | no |
| <a name="input_zones"></a> [zones](#input\_zones) | Zone(s) in which access point may be created. If not supplied, will default to all zones in var.region. | `list(string)` | `[]` | no |
| <a name="input_zones"></a> [zones](#input\_zones) | Zone(s) in which access point may be created. If not supplied, defaults to 2 randomly-selected zones in var.region. | `list(string)` | `[]` | no |

## Outputs

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,10 @@
hosts: localhost
become: true
vars:
job_queue_ha: false
spool_dir: /var/lib/condor/spool
condor_config_root: /etc/condor
ghpc_config_file: 50-ghpc-managed
schedd_ha_config_file: 51-ghpc-schedd-high-availability
htcondor_spool_disk_device: /dev/disk/by-id/google-htcondor-spool-disk
tasks:
- name: Ensure necessary variables are set
ansible.builtin.assert:
Expand Down Expand Up @@ -61,71 +60,48 @@
- name: Configure HTCondor SchedD
when: htcondor_role == 'get_htcondor_submit'
block:
- name: Setup Spool directory
- name: Format spool disk
community.general.filesystem:
fstype: ext4
state: present
dev: "{{ htcondor_spool_disk_device }}"
# RUN TUNE2FS
- name: Mount spool (creates mount point)
ansible.posix.mount:
path: "{{ spool_dir }}"
src: "{{ htcondor_spool_disk_device }}"
fstype: ext4
opts: defaults
state: mounted
- name: Ensure spool free space
ansible.builtin.command: tune2fs -r 0 {{ htcondor_spool_disk_device }}
- name: Setup spool directory
ansible.builtin.file:
path: "{{ spool_dir }}"
state: directory
owner: condor
group: condor
mode: 0755
- name: Enable SchedD high availability
when: job_queue_ha | bool
block:
- name: Set SchedD HA configuration (requires restart)
ansible.builtin.copy:
dest: "{{ condor_config_root }}/config.d/{{ schedd_ha_config_file }}"
mode: 0644
content: |
MASTER_HA_LIST=SCHEDD
HA_LOCK_URL=file:{{ spool_dir }}
VALID_SPOOL_FILES=$(VALID_SPOOL_FILES), SCHEDD.lock
HA_POLL_PERIOD=30
SCHEDD_NAME=had-schedd@
notify:
- Restart HTCondor
# although HTCondor is guaranteed to start after mounting remote
# filesystems is *attempted*, it does not guarantee successful mounts;
# this additional SystemD setting will refuse to start HTCondor if the
# spool shared filesystem has not been mounted
- name: Create SystemD override directory for HTCondor
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d
state: directory
owner: root
group: root
mode: 0755
- name: Ensure HTCondor starts after shared filesystem is mounted
ansible.builtin.copy:
dest: /etc/systemd/system/condor.service.d/mount-spool.conf
mode: 0644
content: |
[Unit]
RequiresMountsFor={{ spool_dir }}
notify:
- Reload SystemD
- name: Disable SchedD high availability
when: not job_queue_ha | bool
block:
- name: Remove SchedD HA configuration file
ansible.builtin.file:
path: "{{ condor_config_root }}/config.d/{{ schedd_ha_config_file }}"
state: absent
notify:
- Restart HTCondor
- name: Remove HTCondor SystemD override
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d/mount-spool.conf
state: absent
notify:
- Reload SystemD
- name: Create SystemD override directory for HTCondor
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d
state: directory
owner: root
group: root
mode: 0755
- name: Ensure HTCondor starts after shared filesystem is mounted
ansible.builtin.copy:
dest: /etc/systemd/system/condor.service.d/mount-spool.conf
mode: 0644
content: |
[Unit]
RequiresMountsFor={{ spool_dir }}
notify:
- Reload SystemD
handlers:
- name: Reload SystemD
ansible.builtin.systemd:
daemon_reload: true
- name: Restart HTCondor
ansible.builtin.service:
name: condor
state: restarted
- name: Reload HTCondor
ansible.builtin.service:
name: condor
Expand All @@ -140,4 +116,4 @@
changed_when: false
ansible.builtin.shell: |
set -e -o pipefail
wall "******* HTCondor system configuration complete ********"
wall "******* HTCondor configuration complete; startup-script may still be executing ********"

0 comments on commit 8920ee0

Please sign in to comment.