Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement recovery of HTCondor spool (job queue) #2500

Merged
merged 4 commits into from
May 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 5 additions & 0 deletions community/examples/htc-htcondor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ vars:
deployment_name: htcondor-pool
region: us-central1
zone: us-central1-c
zones:
- $(vars.zone)
- us-central1-f
disk_size_gb: 100
new_image:
family: htcondor-10x
Expand Down Expand Up @@ -129,6 +132,8 @@ deployment_groups:
default_mig_id: $(htcondor_execute_point.mig_id)
enable_public_ips: true
instance_image: $(vars.new_image)
enable_high_availability: true
spool_disk_size_gb: 200 # required for regional HA
outputs:
- access_point_ips
- access_point_name
4 changes: 2 additions & 2 deletions community/modules/compute/htcondor-execute-point/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,8 +210,8 @@ limitations under the License.

| Name | Source | Version |
|------|--------|---------|
| <a name="module_execute_point_instance_template"></a> [execute\_point\_instance\_template](#module\_execute\_point\_instance\_template) | terraform-google-modules/vm/google//modules/instance_template | ~> 10.1.1 |
| <a name="module_mig"></a> [mig](#module\_mig) | github.com/terraform-google-modules/terraform-google-vm//modules/mig | aea74d1 |
| <a name="module_execute_point_instance_template"></a> [execute\_point\_instance\_template](#module\_execute\_point\_instance\_template) | terraform-google-modules/vm/google//modules/instance_template | 10.1.1 |
| <a name="module_mig"></a> [mig](#module\_mig) | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.32.1&depth=1 |

## Resources
Expand Down
6 changes: 4 additions & 2 deletions community/modules/compute/htcondor-execute-point/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ module "startup_script" {

module "execute_point_instance_template" {
source = "terraform-google-modules/vm/google//modules/instance_template"
version = "~> 10.1.1"
version = "10.1.1"

name_prefix = local.name_prefix
project_id = var.project_id
Expand All @@ -163,7 +163,9 @@ module "execute_point_instance_template" {
}

module "mig" {
source = "github.com/terraform-google-modules/terraform-google-vm//modules/mig?ref=aea74d1"
source = "terraform-google-modules/vm/google//modules/mig"
version = "10.1.1"

project_id = var.project_id
region = var.region
distribution_policy_target_shape = var.distribution_policy_target_shape
Expand Down
23 changes: 14 additions & 9 deletions community/modules/scheduler/htcondor-access-point/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,29 +107,31 @@ limitations under the License.
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.1 |
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 3.83 |
| <a name="requirement_time"></a> [time](#requirement\_time) | ~> 0.9 |
| <a name="requirement_random"></a> [random](#requirement\_random) | ~> 3.6 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | >= 3.83 |
| <a name="provider_time"></a> [time](#provider\_time) | ~> 0.9 |
| <a name="provider_random"></a> [random](#provider\_random) | ~> 3.6 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_access_point_instance_template"></a> [access\_point\_instance\_template](#module\_access\_point\_instance\_template) | github.com/terraform-google-modules/terraform-google-vm//modules/instance_template | 84d7959 |
| <a name="module_htcondor_ap"></a> [htcondor\_ap](#module\_htcondor\_ap) | github.com/terraform-google-modules/terraform-google-vm//modules/mig | aea74d1 |
| <a name="module_access_point_instance_template"></a> [access\_point\_instance\_template](#module\_access\_point\_instance\_template) | github.com/terraform-google-modules/terraform-google-vm//modules/instance_template | 73dc845 |
| <a name="module_htcondor_ap"></a> [htcondor\_ap](#module\_htcondor\_ap) | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.32.1&depth=1 |

## Resources

| Name | Type |
|------|------|
| [google_compute_disk.spool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk) | resource |
| [google_compute_region_disk.spool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_disk) | resource |
| [google_storage_bucket_object.ap_config](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket_object) | resource |
| [time_sleep.mig_warmup](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource |
| [random_shuffle.zones](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/shuffle) | resource |
| [google_compute_image.htcondor](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image) | data source |
| [google_compute_instance.ap](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_instance) | data source |
| [google_compute_region_instance_group.ap](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_region_instance_group) | data source |
Expand All @@ -145,16 +147,17 @@ limitations under the License.
| <a name="input_central_manager_ips"></a> [central\_manager\_ips](#input\_central\_manager\_ips) | List of IP addresses of HTCondor Central Managers | `list(string)` | n/a | yes |
| <a name="input_default_mig_id"></a> [default\_mig\_id](#input\_default\_mig\_id) | Default MIG ID for HTCondor jobs; if unset, jobs must specify MIG id | `string` | `""` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | HPC Toolkit deployment name. HTCondor cloud resource names will include this value. | `string` | n/a | yes |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `null` | no |
| <a name="input_distribution_policy_target_shape"></a> [distribution\_policy\_target\_shape](#input\_distribution\_policy\_target\_shape) | Target shape acoss zones for instance group managing high availability of access point | `string` | `"BALANCED"` | no |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `32` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Boot disk size in GB | `string` | `"pd-balanced"` | no |
| <a name="input_distribution_policy_target_shape"></a> [distribution\_policy\_target\_shape](#input\_distribution\_policy\_target\_shape) | Target shape acoss zones for instance group managing high availability of access point | `string` | `"ANY_SINGLE_ZONE"` | no |
| <a name="input_enable_high_availability"></a> [enable\_high\_availability](#input\_enable\_high\_availability) | Provision HTCondor access point in high availability mode | `bool` | `false` | no |
| <a name="input_enable_oslogin"></a> [enable\_oslogin](#input\_enable\_oslogin) | Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. | `string` | `"ENABLE"` | no |
| <a name="input_enable_public_ips"></a> [enable\_public\_ips](#input\_enable\_public\_ips) | Enable Public IPs on the access points | `bool` | `false` | no |
| <a name="input_enable_shielded_vm"></a> [enable\_shielded\_vm](#input\_enable\_shielded\_vm) | Enable the Shielded VM configuration (var.shielded\_instance\_config). | `bool` | `false` | no |
| <a name="input_htcondor_bucket_name"></a> [htcondor\_bucket\_name](#input\_htcondor\_bucket\_name) | Name of HTCondor configuration bucket | `string` | n/a | yes |
| <a name="input_instance_image"></a> [instance\_image](#input\_instance\_image) | Custom VM image with HTCondor and Toolkit support installed."<br><br>Expected Fields:<br>name: The name of the image. Mutually exclusive with family.<br>family: The image family to use. Mutually exclusive with name.<br>project: The project where the image is hosted. | `map(string)` | n/a | yes |
| <a name="input_labels"></a> [labels](#input\_labels) | Labels to add to resources. List key, value pairs. | `map(string)` | n/a | yes |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for HTCondor central managers | `string` | `"c2-standard-4"` | no |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for HTCondor central managers | `string` | `"n2-standard-4"` | no |
| <a name="input_metadata"></a> [metadata](#input\_metadata) | Metadata to add to HTCondor central managers | `map(string)` | `{}` | no |
| <a name="input_mig_id"></a> [mig\_id](#input\_mig\_id) | List of Managed Instance Group IDs containing execute points in this pool (supplied by htcondor-execute-point module) | `list(string)` | `[]` | no |
| <a name="input_network_self_link"></a> [network\_self\_link](#input\_network\_self\_link) | The self link of the network in which the HTCondor central manager will be created. | `string` | `null` | no |
Expand All @@ -163,10 +166,12 @@ limitations under the License.
| <a name="input_region"></a> [region](#input\_region) | Default region for creating resources | `string` | n/a | yes |
| <a name="input_service_account_scopes"></a> [service\_account\_scopes](#input\_service\_account\_scopes) | Scopes by which to limit service account attached to central manager. | `set(string)` | <pre>[<br> "https://www.googleapis.com/auth/cloud-platform"<br>]</pre> | no |
| <a name="input_shielded_instance_config"></a> [shielded\_instance\_config](#input\_shielded\_instance\_config) | Shielded VM configuration for the instance (must set var.enabled\_shielded\_vm) | <pre>object({<br> enable_secure_boot = bool<br> enable_vtpm = bool<br> enable_integrity_monitoring = bool<br> })</pre> | <pre>{<br> "enable_integrity_monitoring": true,<br> "enable_secure_boot": true,<br> "enable_vtpm": true<br>}</pre> | no |
| <a name="input_spool_disk_size_gb"></a> [spool\_disk\_size\_gb](#input\_spool\_disk\_size\_gb) | Boot disk size in GB | `number` | `32` | no |
| <a name="input_spool_disk_type"></a> [spool\_disk\_type](#input\_spool\_disk\_type) | Boot disk size in GB | `string` | `"pd-ssd"` | no |
| <a name="input_spool_parent_dir"></a> [spool\_parent\_dir](#input\_spool\_parent\_dir) | HTCondor access point configuration SPOOL will be set to subdirectory named "spool" | `string` | `"/var/lib/condor"` | no |
| <a name="input_subnetwork_self_link"></a> [subnetwork\_self\_link](#input\_subnetwork\_self\_link) | The self link of the subnetwork in which the HTCondor central manager will be created. | `string` | `null` | no |
| <a name="input_update_policy"></a> [update\_policy](#input\_update\_policy) | Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) | `string` | `"OPPORTUNISTIC"` | no |
| <a name="input_zones"></a> [zones](#input\_zones) | Zone(s) in which access point may be created. If not supplied, will default to all zones in var.region. | `list(string)` | `[]` | no |
| <a name="input_zones"></a> [zones](#input\_zones) | Zone(s) in which access point may be created. If not supplied, defaults to 2 randomly-selected zones in var.region. | `list(string)` | `[]` | no |

## Outputs

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,10 @@
hosts: localhost
become: true
vars:
job_queue_ha: false
spool_dir: /var/lib/condor/spool
condor_config_root: /etc/condor
ghpc_config_file: 50-ghpc-managed
schedd_ha_config_file: 51-ghpc-schedd-high-availability
htcondor_spool_disk_device: /dev/disk/by-id/google-htcondor-spool-disk
tasks:
- name: Ensure necessary variables are set
ansible.builtin.assert:
Expand Down Expand Up @@ -61,71 +60,48 @@
- name: Configure HTCondor SchedD
when: htcondor_role == 'get_htcondor_submit'
block:
- name: Setup Spool directory
- name: Format spool disk
community.general.filesystem:
fstype: ext4
state: present
dev: "{{ htcondor_spool_disk_device }}"
# RUN TUNE2FS
- name: Mount spool (creates mount point)
ansible.posix.mount:
path: "{{ spool_dir }}"
src: "{{ htcondor_spool_disk_device }}"
fstype: ext4
opts: defaults
state: mounted
- name: Ensure spool free space
ansible.builtin.command: tune2fs -r 0 {{ htcondor_spool_disk_device }}
- name: Setup spool directory
ansible.builtin.file:
path: "{{ spool_dir }}"
state: directory
owner: condor
group: condor
mode: 0755
- name: Enable SchedD high availability
when: job_queue_ha | bool
block:
- name: Set SchedD HA configuration (requires restart)
ansible.builtin.copy:
dest: "{{ condor_config_root }}/config.d/{{ schedd_ha_config_file }}"
mode: 0644
content: |
MASTER_HA_LIST=SCHEDD
HA_LOCK_URL=file:{{ spool_dir }}
VALID_SPOOL_FILES=$(VALID_SPOOL_FILES), SCHEDD.lock
HA_POLL_PERIOD=30
SCHEDD_NAME=had-schedd@
notify:
- Restart HTCondor
# although HTCondor is guaranteed to start after mounting remote
# filesystems is *attempted*, it does not guarantee successful mounts;
# this additional SystemD setting will refuse to start HTCondor if the
# spool shared filesystem has not been mounted
- name: Create SystemD override directory for HTCondor
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d
state: directory
owner: root
group: root
mode: 0755
- name: Ensure HTCondor starts after shared filesystem is mounted
ansible.builtin.copy:
dest: /etc/systemd/system/condor.service.d/mount-spool.conf
mode: 0644
content: |
[Unit]
RequiresMountsFor={{ spool_dir }}
notify:
- Reload SystemD
- name: Disable SchedD high availability
when: not job_queue_ha | bool
block:
- name: Remove SchedD HA configuration file
ansible.builtin.file:
path: "{{ condor_config_root }}/config.d/{{ schedd_ha_config_file }}"
state: absent
notify:
- Restart HTCondor
- name: Remove HTCondor SystemD override
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d/mount-spool.conf
state: absent
notify:
- Reload SystemD
- name: Create SystemD override directory for HTCondor
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d
state: directory
owner: root
group: root
mode: 0755
- name: Ensure HTCondor starts after shared filesystem is mounted
ansible.builtin.copy:
dest: /etc/systemd/system/condor.service.d/mount-spool.conf
mode: 0644
content: |
[Unit]
RequiresMountsFor={{ spool_dir }}
notify:
- Reload SystemD
handlers:
- name: Reload SystemD
ansible.builtin.systemd:
daemon_reload: true
- name: Restart HTCondor
ansible.builtin.service:
name: condor
state: restarted
- name: Reload HTCondor
ansible.builtin.service:
name: condor
Expand All @@ -140,4 +116,4 @@
changed_when: false
ansible.builtin.shell: |
set -e -o pipefail
wall "******* HTCondor system configuration complete ********"
wall "******* HTCondor configuration complete; startup-script may still be executing ********"