Hacker News new | past | comments | ask | show | jobs | submit login
AWS Tagging Best Practices (cloudforecast.io)
129 points by toeknee123 on Aug 12, 2020 | hide | past | favorite | 36 comments



One really useful tool: setup a tag policy in your account / organization and make sure that all of your project teams have the ability to view it so finding non-compliant resources can be self-service without having a human in the loop:

https://docs.aws.amazon.com/organizations/latest/userguide/o...


You would think the most obvious feature of something called "tag policies" would be the ability to prevent creation of resources that violate the policy. Of course, due to the way the AWS APIs work, it can't do this. Such a disappointment in practice.

I do agree it is still useful in the capacity you described.


You can do this with enforced_for properties

https://docs.aws.amazon.com/organizations/latest/userguide/o...


This only applies to tagging operations, which are often separate from resource creation operations. For example, you can require a particular tag on an S3 bucket, but you can still just create a completely untagged bucket. The policy doesn't come into play unless you try to tag the bucket.


There are an increasing number of APIs where you can do this with IAM conditions but it’s definitely harder than it should be.

The big thing you need is to be comfortable breaking anything you use until someone can update it (maybe forking a third-party IaC project or vendor stack, etc.) so it’s definitely less friction to remediate afterwards.


I'm sure over time they will retrofit their APIs to better support tag policy enforcement. I am glad they err on the side of maintaining a stable API, so I can see why things are the way they are.

For now, we enforce our policy during code review. Everything is deployed via terraform/terragrunt so it is pretty easy. We have plans to mostly automate this via static code analysis.


Ditto — I think tools like Sentinel policies are a great answer here because it doesn't prevent your administrators from doing something in a hurry but it means that your normal flow of releases will catch missed tags. We're already using that for things like tflint/tfsec/Checkov/etc. so it's a familiar workflow.


One thing to watch out for, since they're managed under Organizations you might expect that they inherit in the same way that Service Control Policies do... and they kind of do (in that a tag policy attached to a root or OU applies to every downstream OU and account), but these policies have "value-setting operators" and "child control operators" [0] that let downstream tag policies carve out exceptions in a way that I haven't quite wrapped my head around yet.

SCPs are conceptually very simple and easy to reason about. Not so with tag policies (and, I guess, "backup policies" or "Artificial Intelligence (AI) services opt-out policies", which are both new since the last time I looked at this documentation).

[0]


We have gates so you can't deploy an instance without requisite tags for the instance and volume.


That’s what I recommend as well but I see this as a situation where you want to do both: put checks into your normal processes but unless you’re able to have 100% consistency across every team and account, you deploy tag policies to ensure that you catch manual work, experiments, vendor-provided stacks, etc. Tag policies cover all resources which support tagging so it’ll include things like those IAM roles you created for security monitoring outside of your normal development flow, etc.


one very useful tool for tagging and tag governance, is https://github.com/cloud-custodian/cloud-custodian you can setup policies to auto tag things with creators, and enforce tag policies, enforce valid vocabularies, and work through remediation and notification workflows. There's even a standalone tool to do retro-active tagging on resource creators via cloud trail querying (c7n_trailcreator).

disclaimer: maintainer on cloud custodian


Hey! CloudCustodian is such an amazing tool! Part 3 of this series would be focus on "audit and find mistagged resources" and cloudcustodian was already on our list to look at. Thanks for pointing me to c7n_trailcreator.

disclaimer: technical co-founder at CloudForecast


One thing that I wish I'd known when I first started to use Cost Allocation Tags in AWS as described in the article -- they won't apply to any resources created before the tag was created & activated. Well, you can apply them but they won't return data to your cost reports. For some resources this is not a big deal -- any individual EC2 instance I can respawn pretty easily, or I'm not doing my job right. But there's lots of things in the accounts I administrate that would be far less trivial to re-create, due as much to organizational issues as technical ones. Being able to apply CATs to old resources and start getting cost data back would be a huge improvement for my day-to-day work.


> Being able to apply CATs to old resources and start getting cost data back would be a huge improvement for my day-to-day work

For what services can't you do this? I recently made a script that iterated over all DynamoDB Tables, Lambdas, SQS, Kinesis Firehose/Analytics/Streams, S3 Buckets, etc. and added a tag to track cost on the individual services.

The only thing I can think of is that you will not have any data before the data you added the tag on.


For any service -- say EC2:

  - launch instance A  
  - create a new CAT and activate it in the Billing console 
  - apply CAT to instance A 
  - get no cost data back for that instance in Cost Explorer filtering on that CAT 
  - terminate instance A 
  - launch instance B 
  - apply CAT to instance B 
  - get cost data back for that instance in Cost Explorer filtering on that CAT
  
It was explained to me by AWS support that this is in fact how it works, after I opened a case into why I wasn't getting full data back for some legacy stuff. So for any new account I would say that your tagging scheme should get built first, before infrastructure or anything else.


Yeah - their complaint, albeit ambiguously stated, seemed to be that you couldn't retroactively allocate historical charges to a newly placed tag.


Sorry for the ambiguity -- but no, I mean getting historical cost data retroactive to the resource creation would be great, but I can understand why that's not possible. What I'm saying is that Cost Allocation Tags specifically won't return any cost data for resources that are older than that CAT's activation date. So not only do I not get past costs but I don't get anything going forward. I went on a tagging spree a while back and then spent a bunch of time later down the road trying to figure out why it was getting info for some things and not others.


Is this an EC2 thing? I've definitely added a new tag to a two year old DynamoDB table and started getting data.


Is that how it works? It's been a long while since I first set up tags for Cost Explorer, but I'm fairly certain it must have applied to, say, pre-existing EC2 instances and so on or else I never would have had it working...


I emphatically agree that tagging should be automated. Terraform, CloudFormation, Pulumi, whatever - make it part of the code. It's way too easy for someone to miss a tag or misspell a tag if it's being done manually.


100% agree. We've seen some crazy typos and other things while helping our customers clean up their tagging. We're developing Part 2 guide right now, which will cover Terraform and CloudFormation.


absolutely manual tagging stuff is a plague. In large enterprises I've seen manual tagging efforts that result in massive amounts of gaps and confusion about who owns what and whatever else is being done with tags.


Someone's job relies on that automation not existing (half kidding).


Well they think that at least


One of our teams set up a remediation service that does realtime tag analysis by following the Cloudtrail API logs. Anyone at the company who knows about the remediation service can simply log in, filter by aws resource type, and find a note indicating either a better configuration, or why the resource just got nuked from a high orbit.

We don't have any issues tracking stuff down - if you create something that can't be traced back your department, it explodes. I've never seen anything more effective at directing a user configuration.

Following through on a threat is essentially magic.


Can you elaborate more on this? Is there any reading I can do online about this topic? Does the service basically just scan to see that all services are tagged with enumerated department names?


You can use CloudWatch Events for this: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/E....


A couple months ago I released automation in our cloud environment to shutdown/startup resources who have been tagged with cron expressionS. It has been an incredibly flexible solution to manage and reduce the cost of resources which charge per time spent operating.

Not only do tags fit well into already existing work flows (CI/CD, policy), cron expressions give freedom to devs/sysadmins to choose whichever schedule they need. Also, this solution allows a central IT team to easily query and calculate duty cycle of any tagged resource, and relate that calculation to money saved.

I suggest using cron expression tagging for any scheduled resource automation, as it will likely be more flexible and more easily monitored than anything natively offered by the cloud vendor.


I know cron (duh) but I've been unable to google "cron expression tagging". Could you please provide a literal example? Is it something like a tag "startup=0 1 * * * " or tag "shutdown=15 * * * * "?


I assume this would be a custom implementation.

For example, tag an EC2 instance with "CronStart: 0 8 * * " and "CronStop: 0 17 * *". Then have a service that queries EC2 instances with CronStart/CronStop tags, and start/stop the instances according to these schedules.

I haven't looked for any services that does this out-of-the-box, but it would not particularly complicated to implement on e.g. AWS Lambda.


fwiw, cloud custodian also supports this (off hours), across ec2, asg, rds, aurora, etc via tag spec on off hours. note though that rds services have much more restrictive tag values.

disclaimer. maintainer on cloud custodian.


This doesn't cover other interesting uses, like tag-based automation. Random examples: Tagging DynamoDB tables to identify which should be backed up and at which frequency (when you don't quite trust the built-in backup); tagging dev RDS databases with a shut-down schedule for nights/week-ends; tagging Elastic IPs and Auto Scaling Groups with a "IP pool ID", and a Lambda that re-assigns EIPs to ASG instances as they are recycled; using a "data flow ID" tag on resources that are in the hot-path of data flows that are subject to high-volume bursts, so you can easily list them and scale them up before known events.


One pattern I like is having a tag for security groups indicating that they should accept traffic from a CDN or other partner service which a scheduled Lambda function will periodically update from a canonical list of CIDR ranges. This makes it really easy to avoid people leaving origins open by mistake since you can still have a blanket ban on 0.0.0.0/0 rules.

These days I think you can use the new customer managed IP prefix list feature they added last month for this specific need so this approach could be simplified if you need to share the same ranges across accounts:

https://docs.aws.amazon.com/vpc/latest/userguide/sharing-man...


Those are really practical and interesting use cases for tags that we should definitely cover more of. Let me note this and we'll be sure to cover more in-depth uses as we develop more parts to this guide! Appreciate the feedback.


I can echo what this article recommends for business tags. When we migrated to IaC (terraform) we tagged nearly all of our resources and set up quite a few budgets and budget overrun alerts.

It gives great insight into infrastructure costs, makes budgeting a lot easier, and saved us a decent chunk of change by letting us know where savings would have the most impact or alerting us when we had unintended cost increases.


I hear ya on that! We've found that a lot of orgs just want to go straight into "how do we save money?" However, without having that visibility through tags and organizing your cost with tags....where do you even start.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: