Skip to content

Commit

Permalink
Implement recovery of HTCondor spool (job queue)
Browse files Browse the repository at this point in the history
Implement high availability of the HTCondor job queue by using a managed
instance group to provision a single VM with "stateful":

- public or private IP address
- disk containing the spool (job queue and related data)

In a single zone, "HA" means reliability against a failure of the access
point VM. In two zones, "HA" means reliability against a failure of the
access points' zone. When configured to use two zones, the disk is
replicated synchronously.

[!IMPORTANT] WIP. Remaining tasks:

- consider renaming enable_high_availability to clarify that it is
  enabling additional reliability against zonal failures
- document and validate restrictions for regional disk (VM families,
  reduces disk performance)
- enforce limit of 1 or 2 zones (regional disk limitation)
- likely switch to ANY_SINGLE_ZONE target shape
- when using 2 zones, identify solution for aligning zones between
  HTCondor pool components (prob. future PR)
  • Loading branch information
tpdownes committed Apr 19, 2024
1 parent bbda553 commit 818119c
Show file tree
Hide file tree
Showing 6 changed files with 142 additions and 73 deletions.
5 changes: 5 additions & 0 deletions community/examples/htc-htcondor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ vars:
deployment_name: htcondor-pool
region: us-central1
zone: us-central1-c
zones:
- $(vars.zone)
- us-central1-f
disk_size_gb: 100
new_image:
family: htcondor-10x
Expand Down Expand Up @@ -129,6 +132,8 @@ deployment_groups:
default_mig_id: $(htcondor_execute_point.mig_id)
enable_public_ips: true
instance_image: $(vars.new_image)
enable_high_availability: true
spool_disk_size_gb: 200 # required for regional HA
outputs:
- access_point_ips
- access_point_name
18 changes: 13 additions & 5 deletions community/modules/scheduler/htcondor-access-point/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,28 +107,33 @@ limitations under the License.
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.1 |
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 3.83 |
| <a name="requirement_random"></a> [random](#requirement\_random) | ~> 3.6 |
| <a name="requirement_time"></a> [time](#requirement\_time) | ~> 0.9 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | >= 3.83 |
| <a name="provider_random"></a> [random](#provider\_random) | ~> 3.6 |
| <a name="provider_time"></a> [time](#provider\_time) | ~> 0.9 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_access_point_instance_template"></a> [access\_point\_instance\_template](#module\_access\_point\_instance\_template) | github.com/terraform-google-modules/terraform-google-vm//modules/instance_template | 84d7959 |
| <a name="module_htcondor_ap"></a> [htcondor\_ap](#module\_htcondor\_ap) | github.com/terraform-google-modules/terraform-google-vm//modules/mig | aea74d1 |
| <a name="module_access_point_instance_template"></a> [access\_point\_instance\_template](#module\_access\_point\_instance\_template) | github.com/tpdownes/terraform-google-vm//modules/instance_template | fix_template_source_v10&depth=1 |
| <a name="module_htcondor_ap"></a> [htcondor\_ap](#module\_htcondor\_ap) | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
| <a name="module_startup_script"></a> [startup\_script](#module\_startup\_script) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.31.1&depth=1 |

## Resources

| Name | Type |
|------|------|
| [google_compute_disk.spool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk) | resource |
| [google_compute_region_disk.spool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_disk) | resource |
| [google_storage_bucket_object.ap_config](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket_object) | resource |
| [random_shuffle.zones](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/shuffle) | resource |
| [time_sleep.mig_warmup](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource |
| [google_compute_image.htcondor](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image) | data source |
| [google_compute_instance.ap](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_instance) | data source |
Expand All @@ -145,7 +150,8 @@ limitations under the License.
| <a name="input_central_manager_ips"></a> [central\_manager\_ips](#input\_central\_manager\_ips) | List of IP addresses of HTCondor Central Managers | `list(string)` | n/a | yes |
| <a name="input_default_mig_id"></a> [default\_mig\_id](#input\_default\_mig\_id) | Default MIG ID for HTCondor jobs; if unset, jobs must specify MIG id | `string` | `""` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | HPC Toolkit deployment name. HTCondor cloud resource names will include this value. | `string` | n/a | yes |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `null` | no |
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Boot disk size in GB | `number` | `32` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Boot disk size in GB | `string` | `"pd-balanced"` | no |
| <a name="input_distribution_policy_target_shape"></a> [distribution\_policy\_target\_shape](#input\_distribution\_policy\_target\_shape) | Target shape acoss zones for instance group managing high availability of access point | `string` | `"BALANCED"` | no |
| <a name="input_enable_high_availability"></a> [enable\_high\_availability](#input\_enable\_high\_availability) | Provision HTCondor access point in high availability mode | `bool` | `false` | no |
| <a name="input_enable_oslogin"></a> [enable\_oslogin](#input\_enable\_oslogin) | Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. | `string` | `"ENABLE"` | no |
Expand All @@ -154,7 +160,7 @@ limitations under the License.
| <a name="input_htcondor_bucket_name"></a> [htcondor\_bucket\_name](#input\_htcondor\_bucket\_name) | Name of HTCondor configuration bucket | `string` | n/a | yes |
| <a name="input_instance_image"></a> [instance\_image](#input\_instance\_image) | Custom VM image with HTCondor and Toolkit support installed."<br><br>Expected Fields:<br>name: The name of the image. Mutually exclusive with family.<br>family: The image family to use. Mutually exclusive with name.<br>project: The project where the image is hosted. | `map(string)` | n/a | yes |
| <a name="input_labels"></a> [labels](#input\_labels) | Labels to add to resources. List key, value pairs. | `map(string)` | n/a | yes |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for HTCondor central managers | `string` | `"c2-standard-4"` | no |
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for HTCondor central managers | `string` | `"n2-standard-4"` | no |
| <a name="input_metadata"></a> [metadata](#input\_metadata) | Metadata to add to HTCondor central managers | `map(string)` | `{}` | no |
| <a name="input_mig_id"></a> [mig\_id](#input\_mig\_id) | List of Managed Instance Group IDs containing execute points in this pool (supplied by htcondor-execute-point module) | `list(string)` | `[]` | no |
| <a name="input_network_self_link"></a> [network\_self\_link](#input\_network\_self\_link) | The self link of the network in which the HTCondor central manager will be created. | `string` | `null` | no |
Expand All @@ -163,10 +169,12 @@ limitations under the License.
| <a name="input_region"></a> [region](#input\_region) | Default region for creating resources | `string` | n/a | yes |
| <a name="input_service_account_scopes"></a> [service\_account\_scopes](#input\_service\_account\_scopes) | Scopes by which to limit service account attached to central manager. | `set(string)` | <pre>[<br> "https://www.googleapis.com/auth/cloud-platform"<br>]</pre> | no |
| <a name="input_shielded_instance_config"></a> [shielded\_instance\_config](#input\_shielded\_instance\_config) | Shielded VM configuration for the instance (must set var.enabled\_shielded\_vm) | <pre>object({<br> enable_secure_boot = bool<br> enable_vtpm = bool<br> enable_integrity_monitoring = bool<br> })</pre> | <pre>{<br> "enable_integrity_monitoring": true,<br> "enable_secure_boot": true,<br> "enable_vtpm": true<br>}</pre> | no |
| <a name="input_spool_disk_size_gb"></a> [spool\_disk\_size\_gb](#input\_spool\_disk\_size\_gb) | Boot disk size in GB | `number` | `32` | no |
| <a name="input_spool_disk_type"></a> [spool\_disk\_type](#input\_spool\_disk\_type) | Boot disk size in GB | `string` | `"pd-ssd"` | no |
| <a name="input_spool_parent_dir"></a> [spool\_parent\_dir](#input\_spool\_parent\_dir) | HTCondor access point configuration SPOOL will be set to subdirectory named "spool" | `string` | `"/var/lib/condor"` | no |
| <a name="input_subnetwork_self_link"></a> [subnetwork\_self\_link](#input\_subnetwork\_self\_link) | The self link of the subnetwork in which the HTCondor central manager will be created. | `string` | `null` | no |
| <a name="input_update_policy"></a> [update\_policy](#input\_update\_policy) | Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) | `string` | `"OPPORTUNISTIC"` | no |
| <a name="input_zones"></a> [zones](#input\_zones) | Zone(s) in which access point may be created. If not supplied, will default to all zones in var.region. | `list(string)` | `[]` | no |
| <a name="input_zones"></a> [zones](#input\_zones) | Zone(s) in which access point may be created. If not supplied, defaults to 2 randomly-selected zones in var.region. | `list(string)` | `[]` | no |

## Outputs

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,10 @@
hosts: localhost
become: true
vars:
job_queue_ha: false
spool_dir: /var/lib/condor/spool
condor_config_root: /etc/condor
ghpc_config_file: 50-ghpc-managed
schedd_ha_config_file: 51-ghpc-schedd-high-availability
htcondor_spool_disk_device: /dev/disk/by-id/google-htcondor-spool-disk
tasks:
- name: Ensure necessary variables are set
ansible.builtin.assert:
Expand Down Expand Up @@ -61,71 +60,48 @@
- name: Configure HTCondor SchedD
when: htcondor_role == 'get_htcondor_submit'
block:
- name: Setup Spool directory
- name: Format spool disk
community.general.filesystem:
fstype: ext4
state: present
dev: "{{ htcondor_spool_disk_device }}"
# RUN TUNE2FS
- name: Mount spool (creates mount point)
ansible.posix.mount:
path: "{{ spool_dir }}"
src: "{{ htcondor_spool_disk_device }}"
fstype: ext4
opts: defaults
state: mounted
- name: Ensure spool free space
ansible.builtin.command: tune2fs -r 0 {{ htcondor_spool_disk_device }}
- name: Setup spool directory
ansible.builtin.file:
path: "{{ spool_dir }}"
state: directory
owner: condor
group: condor
mode: 0755
- name: Enable SchedD high availability
when: job_queue_ha | bool
block:
- name: Set SchedD HA configuration (requires restart)
ansible.builtin.copy:
dest: "{{ condor_config_root }}/config.d/{{ schedd_ha_config_file }}"
mode: 0644
content: |
MASTER_HA_LIST=SCHEDD
HA_LOCK_URL=file:{{ spool_dir }}
VALID_SPOOL_FILES=$(VALID_SPOOL_FILES), SCHEDD.lock
HA_POLL_PERIOD=30
SCHEDD_NAME=had-schedd@
notify:
- Restart HTCondor
# although HTCondor is guaranteed to start after mounting remote
# filesystems is *attempted*, it does not guarantee successful mounts;
# this additional SystemD setting will refuse to start HTCondor if the
# spool shared filesystem has not been mounted
- name: Create SystemD override directory for HTCondor
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d
state: directory
owner: root
group: root
mode: 0755
- name: Ensure HTCondor starts after shared filesystem is mounted
ansible.builtin.copy:
dest: /etc/systemd/system/condor.service.d/mount-spool.conf
mode: 0644
content: |
[Unit]
RequiresMountsFor={{ spool_dir }}
notify:
- Reload SystemD
- name: Disable SchedD high availability
when: not job_queue_ha | bool
block:
- name: Remove SchedD HA configuration file
ansible.builtin.file:
path: "{{ condor_config_root }}/config.d/{{ schedd_ha_config_file }}"
state: absent
notify:
- Restart HTCondor
- name: Remove HTCondor SystemD override
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d/mount-spool.conf
state: absent
notify:
- Reload SystemD
- name: Create SystemD override directory for HTCondor
ansible.builtin.file:
path: /etc/systemd/system/condor.service.d
state: directory
owner: root
group: root
mode: 0755
- name: Ensure HTCondor starts after shared filesystem is mounted
ansible.builtin.copy:
dest: /etc/systemd/system/condor.service.d/mount-spool.conf
mode: 0644
content: |
[Unit]
RequiresMountsFor={{ spool_dir }}
notify:
- Reload SystemD
handlers:
- name: Reload SystemD
ansible.builtin.systemd:
daemon_reload: true
- name: Restart HTCondor
ansible.builtin.service:
name: condor
state: restarted
- name: Reload HTCondor
ansible.builtin.service:
name: condor
Expand All @@ -140,4 +116,4 @@
changed_when: false
ansible.builtin.shell: |
set -e -o pipefail
wall "******* HTCondor system configuration complete ********"
wall "******* HTCondor configuration complete; startup-script may still be executing ********"
63 changes: 56 additions & 7 deletions community/modules/scheduler/htcondor-access-point/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ locals {
enable_oslogin_metadata = var.enable_oslogin == "INHERIT" ? {} : { enable-oslogin = lookup(local.oslogin_api_values, var.enable_oslogin, "") }
metadata = merge(local.network_storage_metadata, local.enable_oslogin_metadata, var.metadata)

host_count = var.enable_high_availability ? 2 : 1
host_count = 1
name_prefix = "${var.deployment_name}-ap"

example_runner = {
Expand Down Expand Up @@ -86,15 +86,19 @@ locals {
args = join(" ", [
"-e htcondor_role=get_htcondor_submit",
"-e config_object=${local.ap_object}",
"-e job_queue_ha=${var.enable_high_availability}",
"-e spool_dir=${var.spool_parent_dir}/spool",
"-e htcondor_spool_disk_device=/dev/disk/by-id/google-${local.spool_disk_device_name}",
])
}

access_point_ips = [data.google_compute_instance.ap.network_interface[0].network_ip]
access_point_name = data.google_compute_instance.ap.name

zones = coalescelist(var.zones, data.google_compute_zones.available.names)
spool_disk_resource_name = "${var.deployment_name}-spool-disk"
spool_disk_device_name = "htcondor-spool-disk"
spool_disk_source = try(google_compute_disk.spool[0].name, google_compute_region_disk.spool[0].self_link)

zones = coalescelist(var.zones, random_shuffle.zones.result)
}

data "google_compute_image" "htcondor" {
Expand All @@ -115,6 +119,11 @@ data "google_compute_zones" "available" {
region = var.region
}

resource "random_shuffle" "zones" {
input = data.google_compute_zones.available.names
result_count = var.enable_high_availability ? 2 : 1
}

data "google_compute_region_instance_group" "ap" {
self_link = time_sleep.mig_warmup.triggers.self_link
lifecycle {
Expand Down Expand Up @@ -153,9 +162,35 @@ module "startup_script" {
runners = local.all_runners
}

resource "google_compute_region_disk" "spool" {
count = var.enable_high_availability ? 1 : 0
name = local.spool_disk_resource_name
labels = local.labels
type = var.spool_disk_type
region = var.region
size = var.spool_disk_size_gb

replica_zones = local.zones

lifecycle {
precondition {
condition = var.spool_disk_size_gb >= 200
error_message = "When using HTCondor access point high availability, var.spool_disk_size_gb must be set to 200 or greater."
}
}
}

resource "google_compute_disk" "spool" {
count = var.enable_high_availability ? 0 : 1
name = local.spool_disk_resource_name
labels = local.labels
type = var.spool_disk_type
zone = local.zones[0]
size = var.spool_disk_size_gb
}

module "access_point_instance_template" {
# tflint-ignore: terraform_module_pinned_source
source = "github.com/terraform-google-modules/terraform-google-vm//modules/instance_template?ref=84d7959"
source = "github.com/tpdownes/terraform-google-vm//modules/instance_template?ref=fix_template_source_v10&depth=1"

name_prefix = local.name_prefix
project_id = var.project_id
Expand All @@ -169,6 +204,7 @@ module "access_point_instance_template" {

machine_type = var.machine_type
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
preemptible = false
startup_script = module.startup_script.startup_script
metadata = local.metadata
Expand All @@ -177,11 +213,19 @@ module "access_point_instance_template" {
# secure boot
enable_shielded_vm = var.enable_shielded_vm
shielded_instance_config = var.shielded_instance_config

# spool disk
additional_disks = [
{
source = local.spool_disk_source
device_name = local.spool_disk_device_name
}
]
}

module "htcondor_ap" {
# tflint-ignore: terraform_module_pinned_source
source = "github.com/terraform-google-modules/terraform-google-vm//modules/mig?ref=aea74d1"
source = "terraform-google-modules/vm/google//modules/mig"
version = "10.1.1"

project_id = var.project_id
region = var.region
Expand Down Expand Up @@ -220,6 +264,11 @@ module "htcondor_ap" {
type = var.update_policy
}]

stateful_disks = [{
device_name = local.spool_disk_device_name
delete_rule = "ON_PERMANENT_INSTANCE_DELETION"
}]

stateful_ips = [{
interface_name = "nic0"
delete_rule = "ON_PERMANENT_INSTANCE_DELETION"
Expand Down
33 changes: 30 additions & 3 deletions community/modules/scheduler/htcondor-access-point/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,15 @@ variable "region" {
}

variable "zones" {
description = "Zone(s) in which access point may be created. If not supplied, will default to all zones in var.region."
description = "Zone(s) in which access point may be created. If not supplied, defaults to 2 randomly-selected zones in var.region."
type = list(string)
default = []
nullable = false

validation {
condition = length(var.zones) == 0 || length(var.zones) == 2
error_message = "Set var.zones to the empty list or 2 zones in var.region"
}
}

variable "distribution_policy_target_shape" {
Expand Down Expand Up @@ -83,7 +88,29 @@ variable "network_storage" {
variable "disk_size_gb" {
description = "Boot disk size in GB"
type = number
default = null
default = 32
nullable = false
}

variable "disk_type" {
description = "Boot disk size in GB"
type = string
default = "pd-balanced"
nullable = false
}

variable "spool_disk_size_gb" {
description = "Boot disk size in GB"
type = number
default = 32
nullable = false
}

variable "spool_disk_type" {
description = "Boot disk size in GB"
type = string
default = "pd-ssd"
nullable = false
}

variable "metadata" {
Expand Down Expand Up @@ -140,7 +167,7 @@ variable "instance_image" {
variable "machine_type" {
description = "Machine type to use for HTCondor central managers"
type = string
default = "c2-standard-4"
default = "n2-standard-4"
}

variable "access_point_runner" {
Expand Down

0 comments on commit 818119c

Please sign in to comment.