(aws-ec2): userDataCausesReplacement timeouts, if resourceSignalTimeout is set #12749

ilko-rbi · 2021-01-28T14:36:11Z

If userDataCausesReplacement is set to true and additionally resourceSignalTimeout is set the CF stack timeouts in the creation of the new EC2 instance due to not able to receive the cfn-signal from the newly created EC2

Reproduction Steps

The following stack works:

class Ec2UserDataReplacementTestStack(core.Stack):
    def __init__(self, scope: core.Construct, sid: str, **kwargs) -> None:
        super().__init__(scope, sid, **kwargs)

        vpc = aws_ec2.Vpc.from_lookup(self, id="vpcimported", vpc_id=VPC_ID)

        user_data_script_path = path.join(Path(__file__).parent.absolute(), "user-data.sh")

        ec2 = aws_ec2.Instance(
            scope=self,
            id="ec2id",
            instance_type=aws_ec2.InstanceType(instance_type_identifier="t3a.large"),
            machine_image=aws_ec2.GenericLinuxImage({"eu-central-1": "ami-0fc812ebb87bb5b8e"}),
            vpc=vpc,
            user_data_causes_replacement=True,
        )
        user_data_asset = aws_s3_assets.Asset(self, "user-data-s3-asset", path=user_data_script_path)
        local_path = ec2.user_data.add_s3_download_command(
            bucket=user_data_asset.bucket, bucket_key=user_data_asset.s3_object_key
        )
        ec2.user_data.add_execute_file_command(file_path=local_path)
        ec2.user_data.add_signal_on_exit_command(ec2)
        user_data_asset.grant_read(ec2.role)

If we add to the ec2.Instance additionally custom timeout for the cfn-signal, a CreationPolicy is generated in the stack and the creation / update of the stack timeouts after the set timeout:

ec2 = aws_ec2.Instance(
            scope=self,
            id="ec2id",
            instance_type=aws_ec2.InstanceType(instance_type_identifier="t3a.large"),
            machine_image=aws_ec2.GenericLinuxImage({"eu-central-1": "ami-0fc812ebb87bb5b8e"}),
            vpc=vpc,
            user_data_causes_replacement=True,
            resource_signal_timeout=core.Duration.minutes(3)
        )

adds:

CreationPolicy:
      ResourceSignal:
        Timeout: PT3M

However the update (even the initial creation) fails:

2021-01-28 13:49:40 UTC+0100 | myteststack2 | ROLLBACK_IN_PROGRESS | The following resource(s) failed to create: [ec2idC85D4938b6c9a12994f1bfa6]. Rollback requested by user.
-- | -- | -- | --
2021-01-28 13:49:39 UTC+0100 | ec2idC85D4938b6c9a12994f1bfa6 | CREATE_FAILED | Failed to receive 1 resource signal(s) within the specified duration
2021-01-28 13:46:07 UTC+0100 | ec2idC85D4938b6c9a12994f1bfa6 | CREATE_IN_PROGRESS | Resource creation Initiated

What did you expect to happen?

Creation / update must work with this setup also

What actually happened?

CF doesn't receive the cfn-signal sent from the newly created EC2 instance, perhaps this relates to the "manipulated" logical IDs generated in this case?

Environment

CDK CLI Version : 1.85.0
Framework Version:
Node.js Version: v12.18.4
OS : Amazon Linux 2
Language (Version): Python 3.7.7

Other

n.a.

This is 🐛 Bug Report

The text was updated successfully, but these errors were encountered:

rix0rrr · 2021-02-08T14:07:42Z

perhaps this relates to the "manipulated" logical IDs generated in this case?

This could very well be the case. You should be able to determine that from the template, see if the logical ID of the instance and the logical ID in the UserData match.

ilko-rbi · 2021-02-25T15:20:53Z

Hi, sorry for the long delay. Yes, the logical IDs match - the next assumption would be that perhaps CloudFormation is confused by this, or we are hitting some bug? If someone can share more details I can open a support request to AWS.

koshic · 2021-11-22T19:09:26Z

@ilko-rbi , it'a bug.

'user_data_causes_replacement' leads to lazy (which is important) logicalId update:

  const digest = md5.digest('hex').substr(0, 16);
  return `${originalLogicalId}${digest}`; // SomeInstance1236HASH

UserData read logicalId and build command to send a signal inside addSignalOnExitCommand method:

  const resourceID = stack.getLogicalId(resource.node.defaultChild); // still SomeInstance1236, due to lazy in #1
  this.addOnExitCommands(`cfn-signal --stack ${stack.stackName} --resource ${resourceID} --region ${stack.region} -- 
  success ($success.ToString().ToLower())`);

So, cfn-signal script contains obsolete logicalId without hash part (SomeInstance1236 in this example) and CF will never know that SomeInstance1236HASH is ready.

Workaround - create similar call to addOnExitCommand manually but get resourceId via 'instance.node.defaultChild as CfnInstance).logicalId'. It works for me and rendered with full logicalId.

…ion with `userDataCausesReplacement` (#18726) If both `addSignalOnExitCommand` _and_ `userDataCausesReplacement` are used it results in an invalid logicalId being used in the `cfn-signal` call. This is due to `addSignalOnExitCommand` getting the logicalID from `Stack.getLogicalId` which does not take into consideration logicalId overrides which `userDataCausesReplacement` uses. This updates `addSignalOnExitCommand` to use the `logicalId` of the resource which is evaluated lazily and happens after all overrides. fixes #12749 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2022-01-31T13:46:53Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

…ion with `userDataCausesReplacement` (aws#18726) If both `addSignalOnExitCommand` _and_ `userDataCausesReplacement` are used it results in an invalid logicalId being used in the `cfn-signal` call. This is due to `addSignalOnExitCommand` getting the logicalID from `Stack.getLogicalId` which does not take into consideration logicalId overrides which `userDataCausesReplacement` uses. This updates `addSignalOnExitCommand` to use the `logicalId` of the resource which is evaluated lazily and happens after all overrides. fixes aws#12749 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

ilko-rbi added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 28, 2021

github-actions bot added the @aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud label Jan 28, 2021

github-actions bot assigned rix0rrr Jan 28, 2021

rix0rrr added effort/small Small work item – less than a day of effort p1 labels Feb 8, 2021

ryparker removed the needs-triage This issue or PR still needs to be triaged. label Jun 1, 2021

ericzbeard unassigned rix0rrr Jun 17, 2021

This was referenced Jan 28, 2022

fix(ec2): UserData.addSignalOnExitCommand does not work in combination with userDataCausesReplacement #18726

Merged

(ec2): addSignalOnExit and userDataCausesReplacement does not work together #11959

Closed

mergify bot closed this as completed in #18726 Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(aws-ec2): userDataCausesReplacement timeouts, if resourceSignalTimeout is set #12749

(aws-ec2): userDataCausesReplacement timeouts, if resourceSignalTimeout is set #12749

ilko-rbi commented Jan 28, 2021

rix0rrr commented Feb 8, 2021

ilko-rbi commented Feb 25, 2021

koshic commented Nov 22, 2021

github-actions bot commented Jan 31, 2022