aws_stepfunctions_tasks: cannot add capacity provider when using EcsEc2LaunchTarget #30171

sandra-selfdecode · 2024-05-13T00:54:25Z

Describe the bug

The example code provided in https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_stepfunctions_tasks/EcsEc2LaunchTarget.html is not functional. Although adding a default asg capacity provider works in the console, doing it in cdk returns an error if the capacity provider is not specified in the EcsRunTask parameters. I currently cannot find any way to add the capacity provider to the parameters.

Expected Behavior

My state machine to successfully execute an aws_stepfunctions_tasks.EcsRunTask with the EcsEc2LaunchTarget when I have defined a default capacity provider strategy for my cluster.

Current Behavior

The task receives an error from the cluster and cannot be started.

"cause": "No Container Instances were found in your cluster. (Service: AmazonECS; Status Code: 400; Error Code: InvalidParameterException; Request ID: cb76661d-7bb1-49ee-8fa7-8ba1feea5656; Proxy: null)",
  "error": "ECS.InvalidParameterException",
  "resource": "runTask.waitForTaskToken",
  "resourceType": "ecs"

Reproduction Steps

cluster = ecs.Cluster(self, "Cluster", vpc=default_vpc.vpc)
capacity_provider = ecs.AsgCapacityProvider(
            self,
            "CapacityProvider",
            auto_scaling_group=autoscaling_group,
            spot_instance_draining=True,
)
cluster.add_asg_capacity_provider(capacity_provider, spot_instance_draining=True)
cluster.add_default_capacity_provider_strategy(
            [
                ecs.CapacityProviderStrategy(
                    capacity_provider=capacity_provider.capacity_provider_name,
                    base=10,
                    weight=100,
                ),
            ]
)
run_task = sfn_tasks.EcsRunTask(
                self,
                task.title().replace("_", ""),
                cluster=cluster,
                launch_target=sfn_tasks.EcsEc2LaunchTarget(),
                task_definition=task_definition,
                container_overrides=[
                  sfn_tasks.ContainerOverride(
                    container_definition=task_definition.default_container,
                    command=sfn.JsonPath.list_at(f"$.{command}"),
                    environment=env,
                  )
                ],
                propagated_tag_source=PropagatedTagSource.TASK_DEFINITION,
                result_path=result_path,
                integration_pattern=sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
                heartbeat_timeout=sfn.Timeout.duration(cdk.Duration.seconds(HEARTBEAT_TIMEOUT)),
                task_timeout=sfn.Timeout.duration(cdk.Duration.minutes(120)),
).add_retry(
                errors=[
                    "States.HeartbeatTimeout",
                    "States.Timeout",
                    "Ecs.ClientException",
                    "Ecs.SdkClientException",
                    "Ecs.ServerException",
                ],
                interval=cdk.Duration.seconds(30),
                jitter_strategy=sfn.JitterType.FULL,
)

Possible Solution

It would be helpful if it could be passed to the launch target in the same manner as placement constraints and placement strategies.

Additional Information/Context

No response

CDK CLI Version

cdk@2.110.1

Framework Version

No response

Node.js Version

18

OS

ubuntu-latest

Language

Python

Language Version

3.11

Other information

No response

The text was updated successfully, but these errors were encountered:

pahud · 2024-05-14T14:53:02Z

This error generally indicates that your ECS cluster does not have any container instance registered into the cluster.

"cause": "No Container Instances were found in your cluster. (Service: AmazonECS; Status Code: 400; Error Code: InvalidParameterException; Request ID: cb76661d-7bb1-49ee-8fa7-8ba1feea5656; Proxy: null)",

The container instances are essentially the ec2 instances from your AsgCapacityProvider which uses your provided ASG as its AutoscalingGroup. And I can't see how you create your ASG from your snippet:

capacity_provider = ecs.AsgCapacityProvider(
            self,
            "CapacityProvider",
            auto_scaling_group=autoscaling_group,
            spot_instance_draining=True,
)

It is very important if you use AsgCapacityProvider, it's your responsibility to specify your own ASG with ecs-compatible machine image as described in the doc here and make sure at least one container instance is registered into the cluster. In most cases, you should just use cluster.addCapacity() and allow ECS to manage the ASG for you.

Can you check your ecs cluster and see if at least one cluster instance has been registered? You may list the container instances using AWS CLI as below:

%  aws ecs list-container-instances --cluster dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3
{
    "containerInstanceArns": [
        "arn:aws:ecs:us-east-1:123456789012:container-instance/dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3/0f244b476146426fb20e253b747900fd",
        "arn:aws:ecs:us-east-1:123456789012:container-instance/dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3/ffbf3f0e25624c45ba60a5da2f19fa96"
    ]
}

sandra-selfdecode · 2024-05-14T18:44:54Z

The cluster has an autoscaling group capacity provider. I skipped some of my code in what I shared because it's a very complicated autoscaling group.
If the ecs:RunTask parameters contain the capacity provider strategy, then the task is successfully sent to the cluster. This has been thoroughly and exhaustively tested for years. However, I cannot find any way to add the capacity provider strategy to sfn_tasks.EcsRunTask so I have to use sfn_tasks.CallAwsService instead. However, using CallAwsService means I lose some of the conveniences found in the console, such as a direct link to the ECS task. When I have 50 of the same task running at the same time, and one has an error, it's really nice to have the direct link that sfn_tasks.EcsRunTask provides.
Here is what works:

run_task = sfn_tasks.CallAwsService(
                self,
                task.title().replace("_", ""),
                service="ecs",
                action="runTask",
                parameters={
                    "Cluster": cluster_name,
                    "CapacityProviderStrategy": [{"CapacityProvider": (capacity_provider.capacity_provider_name)}],
                    "TaskDefinition": task_definition.task_definition_arn,
                    "Overrides": {"ContainerOverrides": [container_override]},
                    "PropagateTags": "TASK_DEFINITION",
                },
                result_path=f"$.{task}",
                integration_pattern=sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
                heartbeat_timeout=sfn.Timeout.duration(
                    cdk.Duration.seconds(settings.HEARTBEAT_TIMEOUT)
                ),
                task_timeout=sfn.Timeout.duration(cdk.Duration.minutes(120)),
                iam_resources=[task_arn],
            ).add_retry(
                errors=[
                    "States.HeartbeatTimeout",
                    "States.Timeout",
                    "Ecs.ClientException",
                    "Ecs.SdkClientException",
                    "Ecs.ServerException",
                ],
                interval=cdk.Duration.seconds(30),
                jitter_strategy=sfn.JitterType.FULL,
            )

The weird thing is that when I make a cluster in the console and give it a default capacity provider strategy like I'm doing in cdk, I don't need to specify the capacity provider strategy in the run task parameters. But for some reason I do, and that is why I'm asking you to let us add it.

sandra-selfdecode added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 13, 2024

github-actions bot added the @aws-cdk/aws-stepfunctions-tasks label May 13, 2024

pahud self-assigned this May 14, 2024

pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. labels May 14, 2024

pahud added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. investigating This issue is being investigated and/or work is in progress to resolve the issue. labels May 14, 2024

pahud removed their assignment May 14, 2024

pahud added p2 labels May 14, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 14, 2024

github-actions bot mentioned this issue May 20, 2024

Weekly issue metrics report #30273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_stepfunctions_tasks: cannot add capacity provider when using EcsEc2LaunchTarget #30171

aws_stepfunctions_tasks: cannot add capacity provider when using EcsEc2LaunchTarget #30171

sandra-selfdecode commented May 13, 2024

pahud commented May 14, 2024

sandra-selfdecode commented May 14, 2024

aws_stepfunctions_tasks: cannot add capacity provider when using EcsEc2LaunchTarget #30171

aws_stepfunctions_tasks: cannot add capacity provider when using EcsEc2LaunchTarget #30171

Comments

sandra-selfdecode commented May 13, 2024

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

pahud commented May 14, 2024

sandra-selfdecode commented May 14, 2024