Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure route53 records really exist after creation and add support for all ASCII characters #4183

Merged
merged 4 commits into from
Apr 12, 2018

Conversation

Crazybus
Copy link
Contributor

@Crazybus Crazybus commented Apr 12, 2018

In #3579 we found some weird behaviour where records starting with @ were always showing up as a change for every terrafrom apply. The weird part about this was that the record was being created correctly and not showing up in state. In #3579 (comment) @bflad pointed out that @ doesn't need to be used with AWS and they actually give you a warning when you try to use it inside the AWS console. With this answer I was happy and able to move these @ records to the apex domain and move on with my life. However I was still very curious why the records weren't showing up in state properly.

Fast forward a few days and I submitted a job to move the values of the @ records to the apex domains for each zone. It worked! However a little while later someone reported an unrelated DNS record was no longer resolving. I had only just recently moved all of our production DNS into terraform so I was worried that the automation for deploying changes had accidentally deleted something. In #3579 (comment) you can see more detail about what actually happened.

Luckily all DNS changes are now automated by a jenkins job running terraform. Looking at the Route53 logs I could see that the unrelated record was removed exactly when terraform was removing the @ record. Another terraform run fixed things up and added the record back but I felt that I needed to figure out what happened. Luckily I was able to reproduce this locally however I still hadn't explained why it only happened for 1 of the 3 zones which had @ records.

Findings

How did a random record get deleted?

I figured out very quickly that this was only happening for zones which had more than 100 records (the default maxItems per page for the AWS api). I noticed this when I tried to reproduce it with a limited dataset and couldn't trigger the issue. So my focus shifted to the code which was looping of the paginated results from AWS.

The root cause was the line https://github.com/terraform-providers/terraform-provider-aws/blob/c77020f539eed9d3f1dab36c2d2d5a624ed3007b/aws/resource_aws_route53_record.go#L646

So the current record in the response was being compared to the *lopts.StartRecordName. Which for the first page of results was the correct record. However if the first result didn't match for some reason (In my case because @.domain.com != \\100.domain.com it would continue looping until it got to the next page. At this stage *lopts.StartRecordName now was set to the first result in the second page of results and then the record matched perfectly!

So this bug would only be triggered if the record being deleted didn't match with how AWS returned the record, and if there were more than 100 records in the zone.

P.S. I tried to write an acceptance test for this behaviour but with my other fixes I now can't trigger a scenario where terraform can create a record it can't find.

Why didn't terraform give an error that it couldn't find the record after creating it?

The above wouldn't have been possible if terraform had correctly detected that it couldn't read the record it had created. When creating and updating records the final line is https://github.com/terraform-providers/terraform-provider-aws/blob/c77020f539eed9d3f1dab36c2d2d5a624ed3007b/aws/resource_aws_route53_record.go#L454

Since this function is also used for checking if a record exists or not it silently will drop the record and remove it from state if the record doesn't exist.

https://github.com/terraform-providers/terraform-provider-aws/blob/c77020f539eed9d3f1dab36c2d2d5a624ed3007b/aws/resource_aws_route53_record.go#L520-L530

So what was happening is terraform was trying to create @.domain.com, after successfully creating it then called resourceAwsRoute53RecordRead which didn't return an error so Terraform didn't give an error.

Other improvements

Maxitems

My first reaction when seeing the looping over the first 100 results was....why? I then discovered the reason and added it as an inline comment so nobody else gets confused in the future. I have now optimised this to only return a single result if we know that there can be only one match. This saves terraform needing to pull in 100 resources everytime it wants to check if a single record exists.

Escaped ASCII characters

AWS supports a lot more special characters apart from * and @ which I have now added support for. You can see a full list at the Route53 Domain Name Format docs. I decided to just convert all of them instead of trying to maintain a list of supported characters. If a user does somehow pick a character that doesn't convert correctly they will now correctly receive an error when terraform tries to read back the record after creating it.

verifying if a record exists after creating/updating since
resourceAwsRoute53RecordRead does not return an error if the record does
not exist
to avoid downloading 100 records for every request
Compare the response records with the original record name instead of
using the StartRecordName. In events where the record wasn't matching
correctly StartRecordName would be updated on the next page and it would
then always match the next record (record 101) leading to accidental
deletions of the wrong record
@ghost ghost added the size/M Managed by automation to categorize the size of a PR. label Apr 12, 2018
@bflad bflad added bug Addresses a defect in current functionality. service/route53 Issues and PRs that pertain to the route53 service. labels Apr 12, 2018
@bflad bflad added this to the v1.15.0 milestone Apr 12, 2018
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Crazybus thank you very much for the detailed comments and logic enhancement here. Both are extremely appreciated! 🚀

24 tests passed (all tests)
=== RUN   TestAccAWSRoute53Record_alias
--- PASS: TestAccAWSRoute53Record_alias (157.67s)
=== RUN   TestAccAWSRoute53Record_txtSupport
--- PASS: TestAccAWSRoute53Record_txtSupport (163.68s)
=== RUN   TestAccAWSRoute53Record_importUnderscored
--- PASS: TestAccAWSRoute53Record_importUnderscored (182.63s)
=== RUN   TestAccAWSRoute53Record_basic
--- PASS: TestAccAWSRoute53Record_basic (183.94s)
=== RUN   TestAccAWSRoute53Record_caaSupport
--- PASS: TestAccAWSRoute53Record_caaSupport (188.77s)
=== RUN   TestAccAWSRoute53Record_failover
--- PASS: TestAccAWSRoute53Record_failover (191.23s)
=== RUN   TestAccAWSRoute53Record_s3_alias
--- PASS: TestAccAWSRoute53Record_s3_alias (193.12s)
=== RUN   TestAccAWSRoute53Record_generatesSuffix
--- PASS: TestAccAWSRoute53Record_generatesSuffix (221.00s)
=== RUN   TestAccAWSRoute53Record_TypeChange
--- PASS: TestAccAWSRoute53Record_TypeChange (233.38s)
=== RUN   TestAccAWSRoute53Record_spfSupport
--- PASS: TestAccAWSRoute53Record_spfSupport (235.53s)
=== RUN   TestAccAWSRoute53Record_latency_basic
--- PASS: TestAccAWSRoute53Record_latency_basic (236.09s)
=== RUN   TestAccAWSRoute53Record_geolocation_basic
--- PASS: TestAccAWSRoute53Record_geolocation_basic (237.92s)
=== RUN   TestAccAWSRoute53Record_weighted_basic
--- PASS: TestAccAWSRoute53Record_weighted_basic (241.35s)
=== RUN   TestAccAWSRoute53Record_AliasChange
--- PASS: TestAccAWSRoute53Record_AliasChange (250.92s)
=== RUN   TestAccAWSRoute53Record_wildcard
--- PASS: TestAccAWSRoute53Record_wildcard (260.54s)
=== RUN   TestAccAWSRoute53Record_aliasUppercase
--- PASS: TestAccAWSRoute53Record_aliasUppercase (271.41s)
=== RUN   TestAccAWSRoute53Record_SetIdentiferChange
--- PASS: TestAccAWSRoute53Record_SetIdentiferChange (277.56s)
=== RUN   TestAccAWSRoute53Record_basic_fqdn
--- PASS: TestAccAWSRoute53Record_basic_fqdn (277.91s)
=== RUN   TestAccAWSRoute53Record_importBasic
--- PASS: TestAccAWSRoute53Record_importBasic (279.52s)
=== RUN   TestAccAWSRoute53Record_empty
--- PASS: TestAccAWSRoute53Record_empty (152.85s)
=== RUN   TestAccAWSRoute53Record_multivalue_answer_basic
--- PASS: TestAccAWSRoute53Record_multivalue_answer_basic (161.80s)
=== RUN   TestAccAWSRoute53Record_longTXTrecord
--- PASS: TestAccAWSRoute53Record_longTXTrecord (182.12s)
=== RUN   TestAccAWSRoute53Record_weighted_alias
--- PASS: TestAccAWSRoute53Record_weighted_alias (385.36s)
=== RUN   TestAccAWSRoute53Record_allowOverwrite
--- PASS: TestAccAWSRoute53Record_allowOverwrite (206.01s)

}

log.Printf("[DEBUG] List resource records sets for zone: %s, opts: %s",
zone, lopts)

var record *route53.ResourceRecordSet

// We need to loop over all records starting from the record we are looking for because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ thank you!

name := cleanRecordName(*recordSet.Name)
if FQDN(strings.ToLower(name)) != FQDN(strings.ToLower(*lopts.StartRecordName)) {

responseName := strings.ToLower(cleanRecordName(*recordSet.Name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: To prevent crashes in the very unlikely scenario that AWS returns nil for either of these, it would be safer to wrap them in aws.StringValue() 👍 e.g. aws.StringValue(recordSet.Name)

@bflad bflad merged commit 9532212 into hashicorp:master Apr 12, 2018
bflad added a commit that referenced this pull request Apr 12, 2018
@bflad
Copy link
Contributor

bflad commented Apr 18, 2018

This has been released in version 1.15.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

@ghost
Copy link

ghost commented Apr 6, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/route53 Issues and PRs that pertain to the route53 service. size/M Managed by automation to categorize the size of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants