Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][AWS] Allow specification of Security Groups for resources. #3501

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

JGSweets
Copy link
Contributor

@JGSweets JGSweets commented May 1, 2024

Similar to #3488 but for SGs
This PR:

  • Allows specification of SGs for controllers vs workers.
  • Currently removes the port requirement to be able to allow SG specification in serve.

Addresses: #3473

In ~/.sky/config.yaml:

  • When setting for a controller wildcard and all other resources ("*")
aws:
  security_group_name:
    - sky-serve-controller-*: skypilot-controller-fake-sg
    - "*": skypilot-default-fake-sg 
  • When setting for everything by specifying a default ("*")
aws:
  security_group_name:
    - "*": skypilot-default-fake-sg
  • When setting for everything by specifying just the security_group_name via string
aws:
  security_group_name: skypilot-default-fake-sg

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • tested running in all 3 configs in AWS
      • Nothing set
      • setting different controller and default values
      • setting default only
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

EDITED:

  • refactored to match the IAM format.

@JGSweets JGSweets force-pushed the allow-security-group-spec branch from ae9633d to e644f58 Compare May 7, 2024 17:12
Copy link
Contributor Author

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback for resolving the following would be appreciated:

  • ports / security_group_name for serve
  • cloud resources only receive cluster_name_on_cloud for make_deploy_variables instead of cluster_name

@@ -920,12 +920,6 @@ def _try_validate_ports(self) -> None:
"""
if self.ports is None:
return
if skypilot_config.get_nested(('aws', 'security_group_name'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how best to resolve this. For Serve, ports are automatically specified for the load balancer which conflicts with being able to set SGs for it.
#3473

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a customized security group, it is possible that the security group does not the required ports opened and the user does not have enough permission to modify the security group to add those rules.
Also, checking the ports that are already open for a security group is a bit tricky, as it is possible that some of the rules open up some ports but only expose them to a specific range of IP.

existing_ports: Set[int] = set()
for existing_rule in sg.ip_permissions:
# Skip any non-tcp rules.
if existing_rule['IpProtocol'] != 'tcp':
continue
# Skip any rules that don't have a FromPort or ToPort.
if 'FromPort' not in existing_rule or 'ToPort' not in existing_rule:
continue
existing_ports.update(
range(existing_rule['FromPort'], existing_rule['ToPort'] + 1))
ports_to_open = resources_utils.port_set_to_ranges(
resources_utils.port_ranges_to_set(ports) - existing_ports)

Tagging @cblmemo for anything to chime in here.

In the long run, we may have to handle those, but it might be fine for now to print a WARNING / hint here to let a user know that the security group is specified and it is not guaranteed that the ports will be correctly opened on AWS.

if self.cloud is None or instance(self.cloud, clouds.AWS):
    security_group_name = skypilot_config.get_nested(('aws', 'security_group_name'), None)
    if security_group_name is not None:
            with ux_utils.print_exception_no_traceback():
                logger.warning(
                    f'Ports {self.ports} and security group name are specified: {security_group_name}. '
                    'It is not guaranteed that the ports will be opened if the specified security group is not correctly set up. '
                    'Please try to specify `ports` only and leave out `aws.security_group_name` in `~/.sky/config.yaml` to allow SkyPilot to automatically create and configure the security group.')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning looks good to me. Random idea: could also consider remove the port specification for load balancer if we use security groups as resoruces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the suggested warning.

@@ -567,7 +569,7 @@ def get_config_schema():
# Validation may fail if $schema is included.
if k != '$schema'
}
resources_schema['properties'].pop('ports')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how best to resolve this. For Serve, ports are automatically specified for the load balancer which conflicts with being able to set SGs for it.
#3473

@@ -102,11 +102,27 @@ Available fields and semantics:

# Security group (optional).
#
# The name of the security group to use for all instances. If not specified,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to match the format of remote identity.

if user_security_group is not None and not isinstance(
user_security_group, str):
for profile in user_security_group:
if fnmatch.fnmatchcase(cluster_name_on_cloud,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not currently have access to cluster_name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! A proposal would be we refactor the make_deploy_resources_variable to take ClusterName dataclass in all cloud classes, instead of a single string for the cluster name:

@dataclasses.dataclass
class ClusterName:
display_name: str
name_on_cloud: str
def __repr__(self) -> str:
return repr(self.display_name)
def __str__(self) -> str:
return self.display_name

To do so, we can move the ClusterName class out of provision module to utils.resources_utils, and let the cloud class and the functions under provision module to use ClusterName object as input.

@JGSweets
Copy link
Contributor Author

JGSweets commented May 8, 2024

@Michaelvll @concretevitamin do you have any suggestions for the concerns I raised? Thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for specifying security group names for different cluster @JGSweets! This is awesome. I left some comments, mainly concerning that specifying security group name will cause open ports not reliable if the user does not have the security group set up correctly. For now, a good warning or hint message can be enough.

if user_security_group is not None and not isinstance(
user_security_group, str):
for profile in user_security_group:
if fnmatch.fnmatchcase(cluster_name_on_cloud,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! A proposal would be we refactor the make_deploy_resources_variable to take ClusterName dataclass in all cloud classes, instead of a single string for the cluster name:

@dataclasses.dataclass
class ClusterName:
display_name: str
name_on_cloud: str
def __repr__(self) -> str:
return repr(self.display_name)
def __str__(self) -> str:
return self.display_name

To do so, we can move the ClusterName class out of provision module to utils.resources_utils, and let the cloud class and the functions under provision module to use ClusterName object as input.

@@ -567,7 +569,7 @@ def get_config_schema():
# Validation may fail if $schema is included.
if k != '$schema'
}
resources_schema['properties'].pop('ports')
resources_schema['properties'].pop('port', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ports should be the correct field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ports is correct as it pops this from _get_single_resources_schema`

'ports': {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean, it seems we changed it to port now, which seems not a valid property to pop? It seems the issue is fixed in the latest commit : )

sky/utils/schemas.py Show resolved Hide resolved
Comment on lines 599 to 616
{
# A list of single-element dict to pretain the
# order.
# Example:
# security_group_name:
# - my-cluster1-*: my-security-group-1
# - my-cluster2-*: my-security-group-2
# - "*"": my-security-group-3
'type': 'array',
'items': {
'type': 'object',
'additionalProperties': {
'type': 'string'
},
'maxProperties': 1,
'minProperties': 1,
},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: Since this pattern has been used by the remote_identity and security_group_name, we can probably have a constant or a function to generate this dict.

@@ -920,12 +920,6 @@ def _try_validate_ports(self) -> None:
"""
if self.ports is None:
return
if skypilot_config.get_nested(('aws', 'security_group_name'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a customized security group, it is possible that the security group does not the required ports opened and the user does not have enough permission to modify the security group to add those rules.
Also, checking the ports that are already open for a security group is a bit tricky, as it is possible that some of the rules open up some ports but only expose them to a specific range of IP.

existing_ports: Set[int] = set()
for existing_rule in sg.ip_permissions:
# Skip any non-tcp rules.
if existing_rule['IpProtocol'] != 'tcp':
continue
# Skip any rules that don't have a FromPort or ToPort.
if 'FromPort' not in existing_rule or 'ToPort' not in existing_rule:
continue
existing_ports.update(
range(existing_rule['FromPort'], existing_rule['ToPort'] + 1))
ports_to_open = resources_utils.port_set_to_ranges(
resources_utils.port_ranges_to_set(ports) - existing_ports)

Tagging @cblmemo for anything to chime in here.

In the long run, we may have to handle those, but it might be fine for now to print a WARNING / hint here to let a user know that the security group is specified and it is not guaranteed that the ports will be correctly opened on AWS.

if self.cloud is None or instance(self.cloud, clouds.AWS):
    security_group_name = skypilot_config.get_nested(('aws', 'security_group_name'), None)
    if security_group_name is not None:
            with ux_utils.print_exception_no_traceback():
                logger.warning(
                    f'Ports {self.ports} and security group name are specified: {security_group_name}. '
                    'It is not guaranteed that the ports will be opened if the specified security group is not correctly set up. '
                    'Please try to specify `ports` only and leave out `aws.security_group_name` in `~/.sky/config.yaml` to allow SkyPilot to automatically create and configure the security group.')

@@ -154,7 +154,11 @@
# we need to take this field from the new yaml.
('provider', 'tpu_node'),
('provider', 'security_group', 'GroupName'),
('available_node_types', 'ray.head.default', 'node_config',
Copy link
Contributor Author

@JGSweets JGSweets May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes an error with IAM roles if the cluster previously existed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we want to always update the IamInstanceProfile for an old cluster when a user update the ~/.sky/config.yaml? Does this work with launching an existing cluster with a different remote_identity? It would be nice to include some tests in the PR description : )

Comment on lines +926 to +936
if security_group_name is not None:
with ux_utils.print_exception_no_traceback():
logger.warning(
f'Ports {self.ports} and security group name are '
f'specified: {security_group_name}. It is not '
'guaranteed that the ports will be opened if the '
'specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out '
'`aws.security_group_name` in `~/.sky/config.yaml` to '
'allow SkyPilot to automatically create and configure '
'the security group.')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the ideal place as it seems to print the warning 10+ times per launch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! Let's remove the warning here and move it to the aws.make_deploy_resources_variables so that the warning will only shown when launching an exact launchable resource. (check out the comment in aws.py above)

@JGSweets JGSweets marked this pull request as ready for review May 13, 2024 01:52
@JGSweets
Copy link
Contributor Author

@Michaelvll @cblmemo I've updated the PR with the ClusterName refactor.

Do we want to further refactor provisioners? If so, should we do it in this PR or a subsequent PR?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @JGSweets and sorry for the delay! This PR and the ClusterName refactoring looks mostly good to me. Left several additional comments. I don't think we need further refactoring for provisioners at the moment.

@@ -567,7 +569,7 @@ def get_config_schema():
# Validation may fail if $schema is included.
if k != '$schema'
}
resources_schema['properties'].pop('ports')
resources_schema['properties'].pop('port', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean, it seems we changed it to port now, which seems not a valid property to pop? It seems the issue is fixed in the latest commit : )

@@ -154,7 +154,11 @@
# we need to take this field from the new yaml.
('provider', 'tpu_node'),
('provider', 'security_group', 'GroupName'),
('available_node_types', 'ray.head.default', 'node_config',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we want to always update the IamInstanceProfile for an old cluster when a user update the ~/.sky/config.yaml? Does this work with launching an existing cluster with a different remote_identity? It would be nice to include some tests in the PR description : )

Comment on lines +415 to 420
if user_security_group is None and resources.ports is not None:
# Already checked in Resources._try_validate_ports
assert user_security_group is None
security_group = USER_PORTS_SECURITY_GROUP_NAME.format(
cluster_name_on_cloud)
elif user_security_group is not None:
assert resources.ports is None
security_group = user_security_group
else:
cluster_name.name_on_cloud)
elif user_security_group is None:
security_group = DEFAULT_SECURITY_GROUP_NAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the warning from aws.py to here so that it will only be printed once whenever it is trying to launch an AWS cluster with the security group specified.

        security_group = user_security_group
        if security_group is None:
            if resources.ports is not None:
                # Already checked in Resources._try_validate_ports
                security_group = USER_PORTS_SECURITY_GROUP_NAME.format(
                    cluster_name.name_on_cloud)
            elif user_security_group is None:
                security_group = DEFAULT_SECURITY_GROUP_NAME
        elif resources.ports is not None:
            # resources.ports will only take effect when the security group is not specified
            # by user.
            logger.warning(
                f'{colorama.Fore.YELLOW}AWS security group name is specified '
                f'({security_group}) with aws.security_group_name in ~/.sky/config.yaml, '
                f'while ports ({resources.ports}) is also specified in task resources. The '
                'ports are not guaranteed to be opened in the specified security group, '
                'due to the complex rules setup. Please manually make sure it.'
                f'{colorama.Style.RESET_ALL}')

Unfortunately, since we call make_deploy_resources_variables in the following place, the logging will still twice for each failover. The resources_vars retrieved is just for getting the custom_resources later, so we can add a custom_resources field in the provision_record returned by the bulk_provision.

resources_vars = (
to_provision.cloud.make_deploy_resources_variables(
to_provision,
resources_utils.ClusterName(
cluster_name, handle.cluster_name_on_cloud),
region, zones))

provision_record = provisioner.bulk_provision(
to_provision.cloud,
region,
zones,
resources_utils.ClusterName(
cluster_name, handle.cluster_name_on_cloud),
num_nodes=num_nodes,
cluster_yaml=handle.cluster_yaml,
prev_cluster_ever_up=prev_cluster_ever_up,
log_dir=self.log_dir)

I am ok to leave the change for bulk_provision and provision_record to a future PR if we think the current PR involves too many things.

raise ValueError(
'Cannot specify ports when AWS security group name is '
'specified.')
if self.cloud is None or isinstance(self.cloud, clouds.AWS):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to:

Suggested change
if self.cloud is None or isinstance(self.cloud, clouds.AWS):
if isinstance(self.cloud, clouds.AWS):

Comment on lines +927 to +936
with ux_utils.print_exception_no_traceback():
logger.warning(
f'Ports {self.ports} and security group name are '
f'specified: {security_group_name}. It is not '
'guaranteed that the ports will be opened if the '
'specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out '
'`aws.security_group_name` in `~/.sky/config.yaml` to '
'allow SkyPilot to automatically create and configure '
'the security group.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with ux_utils.print_exception_no_traceback():
logger.warning(
f'Ports {self.ports} and security group name are '
f'specified: {security_group_name}. It is not '
'guaranteed that the ports will be opened if the '
'specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out '
'`aws.security_group_name` in `~/.sky/config.yaml` to '
'allow SkyPilot to automatically create and configure '
'the security group.')
logger.warning(
f'Ports {self.ports} and security group name are '
f'specified: {security_group_name}. It is not '
'guaranteed that the ports will be opened if the '
'specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out '
'`aws.security_group_name` in `~/.sky/config.yaml` to '
'allow SkyPilot to automatically create and configure '
'the security group.')

Comment on lines +33 to +60
def _get_cloud_name_property_mapping(field: str):
return {
field: {
'oneOf': [
{
'type': 'string'
},
{
# A list of single-element dict to pretain the
# order.
# Example:
# property_name:
# - my-cluster1-*: my-property-1
# - my-cluster2-*: my-property-2
# - "*"": my-property-3
'type': 'array',
'items': {
'type': 'object',
'additionalProperties': {
'type': 'string'
},
'maxProperties': 1,
'minProperties': 1,
},
}
]
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use a constant instead and let the callers use the constant.

_PRORPERTY_NAME_OR_CLUSTER_NAME_TO_PROPERTY = {
            'oneOf': [
                {
                    'type': 'string'
                },
                {
                    # A list of single-element dict to pretain the
                    # order.
                    # Example:
                    #  property_name:
                    #    - my-cluster1-*: my-property-1
                    #    - my-cluster2-*: my-property-2
                    #    - "*"": my-property-3
                    'type': 'array',
                    'items': {
                        'type': 'object',
                        'additionalProperties': {
                            'type': 'string'
                        },
                        'maxProperties': 1,
                        'minProperties': 1,
                    },
                }
            ]
        }

In the callers, we just use:

'remote_identity': _PRORPERTY_NAME_OR_CLUSTER_NAME_TO_PROPERTY
'security_group_name': _PRORPERTY_NAME_OR_CLUSTER_NAME_TO_PROPERTY

Comment on lines +926 to +936
if security_group_name is not None:
with ux_utils.print_exception_no_traceback():
logger.warning(
f'Ports {self.ports} and security group name are '
f'specified: {security_group_name}. It is not '
'guaranteed that the ports will be opened if the '
'specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out '
'`aws.security_group_name` in `~/.sky/config.yaml` to '
'allow SkyPilot to automatically create and configure '
'the security group.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! Let's remove the warning here and move it to the aws.make_deploy_resources_variables so that the warning will only shown when launching an exact launchable resource. (check out the comment in aws.py above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants