Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-ecs: Cluster.addAsgCapacityProvider blocks EC2 metadata in instances created by the ASG #28270

Closed
Brads3290 opened this issue Dec 6, 2023 · 7 comments · Fixed by #28437
Closed
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@Brads3290
Copy link

Brads3290 commented Dec 6, 2023

Describe the bug

Calling Cluster.addAsgCapacityProvider modifies the AutoScalingGroup to add a userdata script to the launch template which blocks access to 169.254.169.254:

"MyEcsClusterAsgLaunchTemplate854B9AAD": {
   "Type": "AWS::EC2::LaunchTemplate",
   "Properties": {
    "LaunchTemplateData": {

     ...

     "UserData": {
      "Fn::Base64": {
       "Fn::Join": [
        "",
        [
         "#!/bin/bash\necho ECS_CLUSTER=",
         {
          "Ref": "MyEcsClusterACF79753"
         },
         " >> /etc/ecs/ecs.config\nsudo iptables --insert FORWARD 1 --in-interface docker+ --destination 169.254.169.254/32 --jump DROP\nsudo service iptables save\necho ECS_AWSVPC_BLOCK_IMDS=true >> /etc/ecs/ecs.config"
        ]
       ]
      }
     }
    },
    
    ...
}

In particular: sudo iptables --insert FORWARD 1 --in-interface docker+ --destination 169.254.169.254/32 --jump DROP

Without the call to Cluster.addAsgCapacityProvider (and possibly other methods do this too), this isn't added.

This behaviour isn't documented anywhere that I can find, and causes the error No RegionEndpoint or ServiceURL configured from the SDK when using the default settings and expecting everything to be picked up from the environment (which is how it seems to work when creating the cluster from the console).

Expected Behavior

Don't block access to EC2 metadata, or at least clearly document it and offer alternatives if someone wants to make use of the TaskRole inside the container

Current Behavior

The EC2 metadata endpoint is blocked using sudo iptables --insert FORWARD 1 --in-interface docker+ --destination 169.254.169.254/32 --jump DROP

Reproduction Steps

var app = new App();
var stack = new Stack(app, "ReproStack", new StackProps() {
    Env = new Amazon.CDK.Environment {
        Account = System.Environment.GetEnvironmentVariable("CDK_DEFAULT_ACCOUNT"),
        Region = System.Environment.GetEnvironmentVariable("CDK_DEFAULT_REGION"),
    },
});

var defaultVpc = Vpc.FromLookup(stack, "defaultVpc", new VpcLookupOptions() {
    IsDefault = true,
});

var asg = new AutoScalingGroup(stack, "AutoScalingGroup", new AutoScalingGroupProps() {
    Vpc = defaultVpc,
    InstanceType = InstanceType.Of(InstanceClass.T3A, InstanceSize.SMALL),
    MachineImage = EcsOptimizedImage.AmazonLinux2(),
});

var cluster = new Cluster(stack, "Cluster", new ClusterProps() {
    Vpc = defaultVpc,
});

var provider = new AsgCapacityProvider(stack, "AsgCapacityProvider", new AsgCapacityProviderProps() {
    AutoScalingGroup = asg,
});

// With this line, the UserData is included in the ASG's launch template; without it, it is not
cluster.AddAsgCapacityProvider(provider);

app.Synth();

Possible Solution

Don't block access to EC2 metadata, or at least clearly document it and offer alternatives if someone wants to make use of the TaskRole inside the container

Additional Information/Context

No response

CDK CLI Version

2.114.0 (build 12879fa)

Framework Version

No response

Node.js Version

v18.16.1

OS

MacOS 13.6.2

Language

.NET

Language Version

.NET 8

Other information

No response

@Brads3290 Brads3290 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 6, 2023
@github-actions github-actions bot added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Dec 6, 2023
@Brads3290
Copy link
Author

Brads3290 commented Dec 6, 2023

I tested this a bit further, looks like the task role still works with the SDK, it's just the region that is unable to be automatically determined.

I've updated my task definitions to set AWS_REGION, and everything works now.

It only became a big headache for me because I had my containers set up in a cluster that I created using the console, then I decided to move it into my CDK stack, and all of a sudden my containers were failing with No RegionEndpoint or ServiceURL configured.

So I think ideally either stop blocking EC2 instance metadata (or add a boolean option to turn this on/off), or make it very clear in the docs that SDK usage in ECS containers won't work the same as in normal EC2 instances if the cluster is set up with CDK.

@pahud
Copy link
Contributor

pahud commented Dec 6, 2023

https://github.com/aws/aws-cdk/blame/2d9de189e583186f2b77386ae4fcfff42c864568/packages/aws-cdk-lib/aws-ecs/lib/cluster.ts#L502-L504

I think your taskRole should work as expected if you provision your ecs cluster/service/task with CDK. I can't find any relevant document about the blocking for the metadata access but we need to be very careful if we remove it as it's been a stable module for a long time and this might cause additional issues or breaking changes.

Your Reproduction Steps look great to me and the task role should work as expected. Can you share more about your task definition and service running on this ASG and the error message you've seen with that?

@pahud pahud added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Dec 6, 2023
@Brads3290
Copy link
Author

Brads3290 commented Dec 6, 2023

Can you share more about your task definition and service running on this ASG and the error message you've seen with that?

@pahud Sure. It's an AspNetCore app, .NET 8. At the point where it was failing, it was trying to fetch SSM parameters for the app configuration using the Amazon.Extensions.Configuration.SystemsManager package, which I believe internally uses the AWS SDK (AWSSDK.SimpleSystemsManagement) to pull those parameters.

The task definition is pretty basic, just setting the network mode (I think bridge is the default anyway?), task role permissions for SSM and DynamoDB, and adding the container.

// Task Definition
_task = new(scope, $"{id}-TaskDefinition", new TaskDefinitionProps() {
    NetworkMode = NetworkMode.BRIDGE,
});

_task.TaskRole.AttachInlinePolicy(new(scope, $"{id}-Policy", new PolicyProps() {
    Statements = new PolicyStatement[] {
        new(new PolicyStatementProps() {
            Actions = new[] {
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:GetParametersByPath",
            },
            Effect = Effect.ALLOW,
            Resources = new[] {
                $"arn:aws:ssm:ap-southeast-2:{Account}:parameter/{InfrastructureConstants.SsmParameterPrefixGlobal}",
                $"arn:aws:ssm:ap-southeast-2:{Account}:parameter/{InfrastructureConstants.SsmParameterPrefixAuthService}",
            },
        }),
        new(new PolicyStatementProps() {
            Actions = new[] {
                "dynamodb:BatchWriteItem",
                "dynamodb:PutItem",
                "dynamodb:DescribeTable",
                "dynamodb:DeleteItem",
                "dynamodb:GetItem",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
            },
            Effect = Effect.ALLOW,
            Resources = new[] {
                authSessionsTable.TableArn,
            },
        }),
    },
}));


// Container
var environment = new Dictionary<string, string>() {
    { "ASPNETCORE_ENVIRONMENT", "Production" },
    { "ASPNETCORE_URLS", "http://+:80" },
};

// -- This was only added after discovering the issue; you will need to remove this block to reproduce the problem.
if (!String.IsNullOrEmpty(props.Region)) {
    environment["AWS_REGION"] = props.Region;
    environment["AWS_DEFAULT_REGION"] = props.Region;
}

var container = _task.AddContainer($"{id}-Container", new ContainerDefinitionOptions() {
    Image = ContainerImage.FromEcrRepository(Repository),
    MemoryReservationMiB = 128,
    PortMappings = new IPortMapping[] {
        new PortMapping() { ContainerPort = 80, HostPort = 0 },
    },
    Environment = environment,
    Logging = new AwsLogDriver(new AwsLogDriverProps() {
        StreamPrefix = id,
    }),
});

// Service
var service = new Ec2Service(scope, $"{id}-Service", new Ec2ServiceProps() {
    Cluster = props.EcsCluster,
    TaskDefinition = _task,
    DesiredCount = 0, // I adjust this in the console once I push the code to the ECR repository.
});

The line that's failing in Program.cs:

var builder = WebApplication.CreateBuilder(args);

// Exception here. Note -- optional must be `false`. If true, it will fail silently and just not apply the configuration without throwing.
builder.Configuration
    .AddSystemsManager($"/{InfrastructureConstants.SsmParameterPrefixGlobal}", optional: false, reloadAfter: TimeSpan.FromMinutes(1))
    .AddSystemsManager($"/{InfrastructureConstants.SsmParameterPrefixAuthService}", optional: false, reloadAfter: TimeSpan.FromMinutes(1));

The error message in the logs on service startup:

Unhandled exception. System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
 ---> Amazon.Runtime.AmazonClientException: No RegionEndpoint or ServiceURL configured
   at Amazon.Runtime.ClientConfig.Validate()
   at Amazon.Runtime.AmazonServiceClient..ctor(AWSCredentials credentials, ClientConfig config)
   at Amazon.SimpleSystemsManagement.AmazonSimpleSystemsManagementClient..ctor(AWSCredentials credentials, AmazonSimpleSystemsManagementConfig clientConfig)
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
   --- End of inner exception stack trace ---
   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
   at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Amazon.Extensions.NETCore.Setup.ClientFactory.CreateClient(Type serviceInterfaceType, AWSCredentials credentials, ClientConfig config)
   at Amazon.Extensions.NETCore.Setup.ClientFactory.CreateServiceClient(ILogger logger, Type serviceInterfaceType, AWSOptions options)
   at Amazon.Extensions.NETCore.Setup.AWSOptions.CreateServiceClient[T]()
   at Amazon.Extensions.Configuration.SystemsManager.Internal.SystemsManagerProcessor.GetParametersByPathAsync()
   at Amazon.Extensions.Configuration.SystemsManager.Internal.SystemsManagerProcessor.GetDataAsync()
   at Amazon.Extensions.Configuration.SystemsManager.SystemsManagerConfigurationProvider.LoadAsync(Boolean reload)
   at Amazon.Extensions.Configuration.SystemsManager.SystemsManagerConfigurationProvider.Load()
   at Microsoft.Extensions.Configuration.ConfigurationManager.AddSource(IConfigurationSource source)
   at Microsoft.Extensions.Configuration.ConfigurationManager.Microsoft.Extensions.Configuration.IConfigurationBuilder.Add(IConfigurationSource source)
   at Microsoft.Extensions.Configuration.SystemsManagerExtensions.AddSystemsManager(IConfigurationBuilder builder, Action`1 configureSource)
   at Microsoft.Extensions.Configuration.SystemsManagerExtensions.AddSystemsManager(IConfigurationBuilder builder, String path, TimeSpan reloadAfter)
   at Program.<Main>$(String[] args) in /src/RegionalCoordinator/RegionalCoordinator.Web/Program.cs:line 19

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 6, 2023
@juinquok
Copy link
Contributor

I had a similar issue recently whereby access to the instance metadata endpoint was failing resulting in aws-otel and aws-sdk throwing errors on container startup. I was similarly also able to resolve the issue by simply adding AWS_REGION to the container's environment as the only thing that the underlying deps could not source was the region information.

The reason behind the iptables block is documented at: https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/security-iam-roles.html

May I suggest that we add in the AWS_REGION env variable as part of the Cluster. addAsgCapacityProvider method? @pahud I can raise a PR for this

@pahud
Copy link
Contributor

pahud commented Dec 18, 2023

@juinquok

Sounds good! Thank you!

@Brads3290
Copy link
Author

@pahud @juinquok

I also came across that page when I was debugging my issue and thought "well I haven't done that, so that can't be the problem". I think it needs to be documented on the CDK side, that the addAsgCapacityProvider does this (along with any other methods / properties that cause this behaviour).

@mergify mergify bot closed this as completed in #28437 Jan 18, 2024
mergify bot pushed a commit that referenced this issue Jan 18, 2024
…ider for autoscaling (#28437)

### Why is this needed?
When adding a auto scaling group as a capacity provider using `Cluster.addAsgCapacityProvider` and when the task definition being run uses the AWS_VPC network mode, it results in the metadata service at `169.254.169.254` being blocked . This is a security best practice as detailed [here](https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/security-iam-roles.html). This practice is implemented [here](https://github.com/aws/aws-cdk/blame/2d9de189e583186f2b77386ae4fcfff42c864568/packages/aws-cdk-lib/aws-ecs/lib/cluster.ts#L502-L504). However by doing this, some applications such as those raised in #28270 as well as the aws-otel package will not be able to source for the AWS region and thus, cause the application to crash and exit. 

### What does it implement?
This PR add an override to the addContainer method when using the Ec2TaskDefinition to add in the AWS_REGION environment variable to the container if the network mode is set as AWS_VPC. The region is sourced by referencing to the stack which includes this construct at synth time.This environment variable is only required in the EC2 Capacity Provider mode and not in Fargate as this issue of not being able to source for the region on startup is only present when using the EC2 Capacity Provider with the AWS_VPC networking mode. The initial issue addresses this during the `addAsgCapacityProvider` action which targets the cluster. However, we cannot mutate the task definition at that point in time thus, this change addresses it when the task definition is actually added to a service that meets all the requirements whereby the failure to source for region will occur.

Updated the relevant integration tests to reflect the new environment variable being created alongside user-defined environment variables.

Closes #28270

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
3 participants