Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_stepfunctions_tasks: cannot add capacity provider when using EcsEc2LaunchTarget #30171

Open
sandra-selfdecode opened this issue May 13, 2024 · 5 comments

Comments

@sandra-selfdecode
Copy link

Describe the bug

The example code provided in https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_stepfunctions_tasks/EcsEc2LaunchTarget.html is not functional. Although adding a default asg capacity provider works in the console, doing it in cdk returns an error if the capacity provider is not specified in the EcsRunTask parameters. I currently cannot find any way to add the capacity provider to the parameters.

Expected Behavior

My state machine to successfully execute an aws_stepfunctions_tasks.EcsRunTask with the EcsEc2LaunchTarget when I have defined a default capacity provider strategy for my cluster.

Current Behavior

The task receives an error from the cluster and cannot be started.

"cause": "No Container Instances were found in your cluster. (Service: AmazonECS; Status Code: 400; Error Code: InvalidParameterException; Request ID: cb76661d-7bb1-49ee-8fa7-8ba1feea5656; Proxy: null)",
  "error": "ECS.InvalidParameterException",
  "resource": "runTask.waitForTaskToken",
  "resourceType": "ecs"

Reproduction Steps

cluster = ecs.Cluster(self, "Cluster", vpc=default_vpc.vpc)
capacity_provider = ecs.AsgCapacityProvider(
            self,
            "CapacityProvider",
            auto_scaling_group=autoscaling_group,
            spot_instance_draining=True,
)
cluster.add_asg_capacity_provider(capacity_provider, spot_instance_draining=True)
cluster.add_default_capacity_provider_strategy(
            [
                ecs.CapacityProviderStrategy(
                    capacity_provider=capacity_provider.capacity_provider_name,
                    base=10,
                    weight=100,
                ),
            ]
)
run_task = sfn_tasks.EcsRunTask(
                self,
                task.title().replace("_", ""),
                cluster=cluster,
                launch_target=sfn_tasks.EcsEc2LaunchTarget(),
                task_definition=task_definition,
                container_overrides=[
                  sfn_tasks.ContainerOverride(
                    container_definition=task_definition.default_container,
                    command=sfn.JsonPath.list_at(f"$.{command}"),
                    environment=env,
                  )
                ],
                propagated_tag_source=PropagatedTagSource.TASK_DEFINITION,
                result_path=result_path,
                integration_pattern=sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
                heartbeat_timeout=sfn.Timeout.duration(cdk.Duration.seconds(HEARTBEAT_TIMEOUT)),
                task_timeout=sfn.Timeout.duration(cdk.Duration.minutes(120)),
).add_retry(
                errors=[
                    "States.HeartbeatTimeout",
                    "States.Timeout",
                    "Ecs.ClientException",
                    "Ecs.SdkClientException",
                    "Ecs.ServerException",
                ],
                interval=cdk.Duration.seconds(30),
                jitter_strategy=sfn.JitterType.FULL,
)

Possible Solution

It would be helpful if it could be passed to the launch target in the same manner as placement constraints and placement strategies.

Additional Information/Context

No response

CDK CLI Version

[email protected]

Framework Version

No response

Node.js Version

18

OS

ubuntu-latest

Language

Python

Language Version

3.11

Other information

No response

@sandra-selfdecode sandra-selfdecode added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 13, 2024
@pahud pahud self-assigned this May 14, 2024
@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. labels May 14, 2024
@pahud
Copy link
Contributor

pahud commented May 14, 2024

This error generally indicates that your ECS cluster does not have any container instance registered into the cluster.

"cause": "No Container Instances were found in your cluster. (Service: AmazonECS; Status Code: 400; Error Code: InvalidParameterException; Request ID: cb76661d-7bb1-49ee-8fa7-8ba1feea5656; Proxy: null)",

The container instances are essentially the ec2 instances from your AsgCapacityProvider which uses your provided ASG as its AutoscalingGroup. And I can't see how you create your ASG from your snippet:

capacity_provider = ecs.AsgCapacityProvider(
            self,
            "CapacityProvider",
            auto_scaling_group=autoscaling_group,
            spot_instance_draining=True,
)

It is very important if you use AsgCapacityProvider, it's your responsibility to specify your own ASG with ecs-compatible machine image as described in the doc here and make sure at least one container instance is registered into the cluster. In most cases, you should just use cluster.addCapacity() and allow ECS to manage the ASG for you.

Can you check your ecs cluster and see if at least one cluster instance has been registered? You may list the container instances using AWS CLI as below:

%  aws ecs list-container-instances --cluster dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3
{
    "containerInstanceArns": [
        "arn:aws:ecs:us-east-1:123456789012:container-instance/dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3/0f244b476146426fb20e253b747900fd",
        "arn:aws:ecs:us-east-1:123456789012:container-instance/dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3/ffbf3f0e25624c45ba60a5da2f19fa96"
    ]
}

@pahud pahud added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. investigating This issue is being investigated and/or work is in progress to resolve the issue. labels May 14, 2024
@pahud pahud removed their assignment May 14, 2024
@pahud pahud added p2 labels May 14, 2024
@sandra-selfdecode
Copy link
Author

The cluster has an autoscaling group capacity provider. I skipped some of my code in what I shared because it's a very complicated autoscaling group.
If the ecs:RunTask parameters contain the capacity provider strategy, then the task is successfully sent to the cluster. This has been thoroughly and exhaustively tested for years. However, I cannot find any way to add the capacity provider strategy to sfn_tasks.EcsRunTask so I have to use sfn_tasks.CallAwsService instead. However, using CallAwsService means I lose some of the conveniences found in the console, such as a direct link to the ECS task. When I have 50 of the same task running at the same time, and one has an error, it's really nice to have the direct link that sfn_tasks.EcsRunTask provides.
Here is what works:

run_task = sfn_tasks.CallAwsService(
                self,
                task.title().replace("_", ""),
                service="ecs",
                action="runTask",
                parameters={
                    "Cluster": cluster_name,
                    "CapacityProviderStrategy": [{"CapacityProvider": (capacity_provider.capacity_provider_name)}],
                    "TaskDefinition": task_definition.task_definition_arn,
                    "Overrides": {"ContainerOverrides": [container_override]},
                    "PropagateTags": "TASK_DEFINITION",
                },
                result_path=f"$.{task}",
                integration_pattern=sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
                heartbeat_timeout=sfn.Timeout.duration(
                    cdk.Duration.seconds(settings.HEARTBEAT_TIMEOUT)
                ),
                task_timeout=sfn.Timeout.duration(cdk.Duration.minutes(120)),
                iam_resources=[task_arn],
            ).add_retry(
                errors=[
                    "States.HeartbeatTimeout",
                    "States.Timeout",
                    "Ecs.ClientException",
                    "Ecs.SdkClientException",
                    "Ecs.ServerException",
                ],
                interval=cdk.Duration.seconds(30),
                jitter_strategy=sfn.JitterType.FULL,
            )

The weird thing is that when I make a cluster in the console and give it a default capacity provider strategy like I'm doing in cdk, I don't need to specify the capacity provider strategy in the run task parameters. But for some reason I do, and that is why I'm asking you to let us add it.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 14, 2024
@pahud
Copy link
Contributor

pahud commented May 31, 2024

Are you able to see any instance running in your cluster like this? Because as I mentioned above, the error message means there's no valid instance in your cluster. I am guessing your ASG may not correctly launch any ECS instances that register to the cluster.

%  aws ecs list-container-instances --cluster dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3
{
    "containerInstanceArns": [
        "arn:aws:ecs:us-east-1:123456789012:container-instance/dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3/0f244b476146426fb20e253b747900fd",
        "arn:aws:ecs:us-east-1:123456789012:container-instance/dummy-stack2-EcsCluster97242B84-8M6Tf96WgYJ3/ffbf3f0e25624c45ba60a5da2f19fa96"
    ]
}

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 31, 2024
@sandra-selfdecode
Copy link
Author

sandra-selfdecode commented May 31, 2024

Yes, it does.
To clarify, the error occurs when the autoscaling group's capacity is at 0. The task is supposed to be added to the cluster and stay in provisioning status until the asg has scaled up.

@sandra-selfdecode
Copy link
Author

But apart from my problem, it would be really helpful for people to specify capacity provider strategies, which is why boto3 offers that option.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants