AWS Cost Optimization: How to Reduce Your Bills Before They Spiral

Cloud costs are a notoriously complex topic. Not from a technical point of view, but more so from an organizational standpoint. Most discussions about cloud spending happen after a certain “threshold” is reached. This usually happens after one or more stakeholders realize that cloud spending has gotten out of control. The realization is often based on a “gut feeling” that what you’re running shouldn’t be that expensive. Or the business is not doing well and is trying to actively cut costs. A mandate for cost control is put in place and IT infrastructure is the most logical layer to start with.

Point is, costs are usually only relevant once they become problematic. That mindset is dangerous for two reasons. First, a cost-aware engineering culture keeps work focused on revenue-generating efforts. Teams that think about cost while building tend to make better architectural decisions, not just cheaper ones. Second, retrofitting infrastructure for cost reasons is tedious and complex. Rearchitecting a data pipeline or migrating a database to a cheaper storage tier after the fact is significantly harder than designing for it from the start.

In this article we will explore why this mindset is harmful and how you can get a grip on your AWS bills with a more proactive approach.

Key Takeaways

You pay for what exists, not just what runs: idle resources, reserved IPs, and forgotten snapshots all accrue charges
Every AWS charge combines three primitives: time-based, volume-based, and feature-based
Cost attribution is impossible without a tagging strategy in place from day one
AWS Data Exports queried via Athena is the right tool for workload-level cost analysis
For most data workloads, DuckDB or Athena will cost significantly less than a Spark or EMR cluster

How AWS Actually Charges You

Looking at your AWS cloud bill can be confusing. You see a bunch of cloud services listed and a dollar amount next to it that reflects the costs for the service. The bill usually doesn’t answer what specific services, resources, and workloads of your IT infrastructure are causing those costs.

Let’s have a look at how AWS charges you and what mental model is required to approach this topic the right way.

The mental model of cloud costs

You pay for what exists, not just for what runs. In traditional data centers you buy hardware with a specific capacity up front and then it sits there as a capital expense. With cloud providers like AWS, every resource you create starts a meter running.

You spun up a database for testing purposes six months ago and forgot about it? You will still pay for that resource even if it has not served a single query. The same goes for other resources: IP addresses that are reserved but not attached to anything cost money. Snapshots from a migration that completed last year are still accumulating charges.

AWS does not clean up after you, and if you don’t, it shows up in your bill.

What are the three AWS billing primitives?

Everything on your AWS bill reduces to a small set of primitives.

Time-based charges are the most common. A resource exists and for every unit of time it continues to exist, a charge accrues. The unit might be per-second, per-minute, or per-hour depending on the service. Once you create the resource a clock starts and stops only when you delete it.

Volume-based charges apply once you consume a measurable quantity: gigabytes of data transferred, number of API requests made, Lambda invocations. These scale with consumption and occur only when something actually happens, for example when you download a file from S3 or invoke a Lambda function. Most services use tiered pricing rather than a flat per-unit rate, so costs per unit often decrease at higher volumes.

Feature-based charges accrue once you enable a specific capability on top of a base resource. Enabling multi-AZ replication, enhanced monitoring, or database performance monitoring (via CloudWatch Database Insights on RDS) each add a cost layer on top of the underlying resource cost.

Figure: A single RDS instance combines all three primitives at once: time-based instance hours, volume-based I/O, and feature-based charges for multi-AZ or enhanced monitoring.

What are the three dimensions of AWS workload costs?

Beyond the billing primitives, every workload you create generates costs across three independent dimensions. The key word is independent: optimizing one doesn’t automatically reduce the others.

Figure: The three cost dimensions are independently present on every workload. Reducing one doesn’t reduce the others.

Compute is what you pay every time something runs: a server processing requests, a Lambda function executing, a Redshift cluster scanning rows. It maps directly to the resources you provision and is the easiest dimension to act on. Autoscaling, right-sizing, moving to spot instances, switching to serverless: all of these have an immediate, measurable effect. The risk is that compute gets so much attention it crowds out the other two dimensions entirely.

Storage represents data at rest: databases, S3 objects, EBS volumes, RDS snapshots, CloudWatch logs, Glacier archives. Cloud providers have made storage feel like a commodity, and it is cheap per unit, until you start counting units. Data accumulates silently. Logs grow unchecked, snapshots multiply across accounts, old backups never get pruned because nobody owns the cleanup process. What makes storage particularly deceptive is that costs are completely decoupled from access patterns. Archived data you haven’t touched in three years costs the same as data queried constantly, unless you explicitly move it to a cheaper tier. The cloud won’t do this for you.

Network is the most counterintuitive dimension because it operates asymmetrically. Moving data into AWS is almost always free. Moving data out (to the internet, to another region, or between components in different availability zones) costs money. Traffic between services within the same AZ using private IPs is free. This asymmetry is intentional. AWS wants your data inside its ecosystem and charges you to take it out.

Figure: Network pricing is asymmetric. Ingress is free; cross-AZ traffic and egress carry charges that accumulate with scale.

An application that processes data locally and returns a small result is a fundamentally different cost profile than one that streams large payloads across availability zones or fans out responses to multiple internet consumers. Network costs are often invisible until they’re not: they don’t appear in instance dashboards, they accumulate per-byte across dozens of service interactions, and they’re hard to reason about without deliberate instrumentation.

How to Make Your Costs Transparent and Analyzable

Knowing the dimensions is useful, but not enough. An organization needs to be able to drill into those cost dimensions and attribute them to specific workloads. That’s what lets you evaluate the ROI of what you’re running and move fast when something needs to change.

AWS does not provide this visibility by default. The console will tell you that EC2 cost $14,000 last month. It won’t tell you which team’s workload, which environment (prod vs. staging), or which specific pipeline drove that number. Getting that level of attribution requires a deliberate tagging strategy set up before costs become a problem, not after.

What is the right tagging strategy for AWS resources?

AWS lets you attach arbitrary key-value tags to any resource. A minimal schema that actually works in practice needs to cover at least four dimensions: env (prod, staging, dev), team, project, and managed-by (terraform, manual, etc.).

With these four tags consistently applied you can answer questions like “what does our staging environment actually cost?” or “how much does the ingestion pipeline spend per month?” Without them, those questions are unanswerable from the bill alone.

Cost allocation tags

Tagging resources is only half the work. You also need to activate those tags as cost allocation tags in the AWS Billing console under “Cost allocation tags.” Once activated, they become filterable dimensions in Cost Explorer and columns in AWS Data Exports.

AWS distinguishes between AWS-generated tags (like aws:createdBy) and user-defined tags. User-defined cost allocation tags can take up to 24 hours to appear on the activation page, and up to a further 24 hours to become active in billing data after you activate them. It is worth activating them early in a project lifecycle, not when you need the data.

Tagging enforcement via SCP and Terraform

Tags only work if they’re applied consistently, and manually applying tags consistently does not happen in practice. Enforcement needs to happen at two levels.

At the organizational level, AWS Service Control Policies (SCPs) can block resource creation when required tags are absent. Note that SCPs require AWS Organizations and do not apply to the management (root) account itself; they only apply to member accounts and OUs. The policy below denies common resource creation actions when the env, team, or project tags are missing:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyWithoutEnvTag",
      "Effect": "Deny",
      "Action": ["ec2:RunInstances", "rds:CreateDBInstance", "ecs:CreateService"],
      "Resource": "*",
      "Condition": {
        "Null": { "aws:RequestTag/env": "true" }
      }
    },
    {
      "Sid": "DenyWithoutTeamTag",
      "Effect": "Deny",
      "Action": ["ec2:RunInstances", "rds:CreateDBInstance", "ecs:CreateService"],
      "Resource": "*",
      "Condition": {
        "Null": { "aws:RequestTag/team": "true" }
      }
    },
    {
      "Sid": "DenyWithoutProjectTag",
      "Effect": "Deny",
      "Action": ["ec2:RunInstances", "rds:CreateDBInstance", "ecs:CreateService"],
      "Resource": "*",
      "Condition": {
        "Null": { "aws:RequestTag/project": "true" }
      }
    }
  ]
}

Each statement independently denies the action when its respective tag is absent. Since Deny takes precedence over Allow in AWS IAM, any single missing tag blocks the request.

To deploy this SCP via Terraform and attach it to an organizational unit:

resource "aws_organizations_policy" "require_tags" {
  name        = "RequireResourceTags"
  description = "Deny resource creation without required tags"
  type        = "SERVICE_CONTROL_POLICY"
 
  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid      = "DenyWithoutEnvTag"
        Effect   = "Deny"
        Action   = ["ec2:RunInstances", "rds:CreateDBInstance", "ecs:CreateService"]
        Resource = "*"
        Condition = { Null = { "aws:RequestTag/env" = "true" } }
      },
      {
        Sid      = "DenyWithoutTeamTag"
        Effect   = "Deny"
        Action   = ["ec2:RunInstances", "rds:CreateDBInstance", "ecs:CreateService"]
        Resource = "*"
        Condition = { Null = { "aws:RequestTag/team" = "true" } }
      },
      {
        Sid      = "DenyWithoutProjectTag"
        Effect   = "Deny"
        Action   = ["ec2:RunInstances", "rds:CreateDBInstance", "ecs:CreateService"]
        Resource = "*"
        Condition = { Null = { "aws:RequestTag/project" = "true" } }
      }
    ]
  })
}
 
resource "aws_organizations_policy_attachment" "require_tags" {
  policy_id = aws_organizations_policy.require_tags.id
  target_id = var.organizational_unit_id
}

At the infrastructure-as-code level, Terraform modules can enforce tags as required inputs so that any consumer of the module inherits the tag structure automatically:

variable "tags" {
  description = "Required resource tags"
  type = object({
    env     = string
    team    = string
    project = string
  })
 
  validation {
    condition     = contains(["prod", "staging", "dev"], var.tags.env)
    error_message = "env must be one of: prod, staging, dev."
  }
}
 
locals {
  common_tags = merge(var.tags, {
    managed-by = "terraform"
  })
}

Any resource block in the module then passes local.common_tags to its tags argument. When a new engineer spins up a service using this module, the required tags are applied without them having to think about it.

Both layers together are necessary. SCPs catch anything provisioned outside Terraform. Module defaults handle the bulk of the day-to-day volume.

How tags make cost attribution more structured

Once tagging is in place and cost allocation tags are activated, you gain the ability to slice your bill by any tag combination. In Cost Explorer you can filter by team=data-platform and env=prod to see exactly what production data infrastructure costs, independently from staging or other teams.

The practical pattern for deeper analysis is: set up AWS Data Exports, have it write to S3, and query it with Athena. The export contains every line item on your bill at resource-level granularity, with all activated tags as columns. From there you can build a dashboard that lets you filter by team, workload, environment, and time range, and have actual numbers attached to actual workloads instead of gut feelings.

AWS Data Exports (formerly Cost and Usage Report / CUR)

Cost Explorer is a reasonable starting point for spot-checking. AWS Data Exports is where you get genuine analytical power.

Data Exports is the current AWS service for exporting billing data to S3, replacing the legacy Cost and Usage Reports (CUR). It writes a flat-file report to S3 on a continuous cadence, containing every line item on your bill at the resource level with all activated cost allocation tags as columns. You query it with Athena, load it into Redshift Spectrum, or connect it to a BI tool. The downstream workflow is the same as with legacy CUR.

Data Exports supports two formats. The legacy CUR format preserves the existing column schema for teams migrating from the old service. The FOCUS 1.0 format (FinOps Open Cost and Usage Specification) is a vendor-neutral open standard that makes it easier to combine AWS billing data with spend from other cloud providers. For new setups, FOCUS is the better default.

To provision a Data Export via Terraform (requires AWS provider >= 5.40):

resource "aws_bcmdataexports_export" "main" {
  export {
    name = "cur-main"
 
    data_query {
      query_statement = "SELECT * FROM COST_AND_USAGE_REPORT"
      table_configurations = {
        "COST_AND_USAGE_REPORT" = {
          "TIME_GRANULARITY"                      = "HOURLY"
          "INCLUDE_RESOURCES"                     = "TRUE"
          "INCLUDE_MANUAL_DISCOUNT_COMPATIBILITY" = "FALSE"
          "INCLUDE_SPLIT_COST_ALLOCATION_DATA"    = "FALSE"
        }
      }
    }
 
    destination_configurations {
      s3_destination {
        s3_bucket = aws_s3_bucket.cur.bucket
        s3_prefix = "cur"
        s3_region = "us-east-1"
 
        s3_output_configurations {
          compression = "PARQUET"
          format      = "PARQUET"
          output_type = "CUSTOM"
          overwrite   = "OVERWRITE_REPORT"
        }
      }
    }
 
    refresh_cadence {
      frequency = "SYNCHRONOUS"
    }
  }
}

"INCLUDE_RESOURCES" = "TRUE" is the equivalent of the legacy additional_schema_elements = ["RESOURCES"]; without it, line items won’t include resource-level detail and the export is far less useful for workload attribution. The SYNCHRONOUS refresh cadence writes new data as soon as AWS makes it available, which gives you the most current view of spend.

What the report looks like

Once the export lands in S3 and is crawled by Glue, each row in the Parquet file represents one line item. The columns most relevant for cost attribution are:

Column	Example value	What it tells you
`line_item_product_code`	`AmazonEC2`	Which AWS service
`line_item_resource_id`	`i-0abc123def456`	The specific resource
`line_item_usage_type`	`BoxUsage:m5.xlarge`	What was consumed
`line_item_unblended_cost`	`2.34`	Cost in USD for this line item
`line_item_line_item_type`	`Usage`	Charge type (Usage, Tax, Credit, etc.)
`resource_tags_user_team`	`data-platform`	Your `team` tag
`resource_tags_user_env`	`prod`	Your `env` tag
`resource_tags_user_project`	`ingestion`	Your `project` tag
`line_item_usage_start_date`	`2026-05-01 00:00:00`	Hour the usage occurred

Tag columns follow the pattern resource_tags_user_<tagname>. A tag key of team becomes resource_tags_user_team. Any tag that has not been activated as a cost allocation tag will be absent from the export.

The most useful starting query is total cost by team, environment, and service for the current month, excluding tax line items:

SELECT
  resource_tags_user_team    AS team,
  resource_tags_user_env     AS env,
  resource_tags_user_project AS project,
  line_item_product_code     AS service,
  ROUND(SUM(line_item_unblended_cost), 2) AS total_cost_usd
FROM cur_main
WHERE
  line_item_line_item_type != 'Tax'
  AND year  = '2026'
  AND month = '5'
GROUP BY 1, 2, 3, 4
ORDER BY total_cost_usd DESC;

This gives you a ranked breakdown of what each team’s workload is spending per service, per environment. From here you can drill into any row by adding a WHERE resource_tags_user_team = 'data-platform' filter, or pivot to a daily trend by adding line_item_usage_start_date to the SELECT and GROUP BY.

Strategies for Lowering Costs

With visibility in place, you can start acting. The right levers depend on what you’re running.

Data Engineering Workloads

Compute

The most impactful question for data teams is whether they actually need the tools they have defaulted to. Spark clusters and persistent Redshift clusters are the right answer for petabyte-scale processing, but they are regularly deployed on datasets that would run fine on a single machine.

Before reaching for a distributed compute layer, check whether Polars, DuckDB, or Athena with a dbt connector covers the use case. These are dramatically cheaper and often faster for the workloads most data teams actually have. For a transformation that runs on 50 GB of data, DuckDB on a single EC2 instance costs a fraction of what an EMR cluster costs, and it finishes in comparable time.

Where a managed container environment is genuinely needed, ECS Fargate with autoscaling is a better default than a persistent cluster. With Fargate you pay for the vCPU and memory allocated to your task for as long as it runs, not for idle capacity on a persistent cluster. A target tracking policy keeps the service right-sized without manual intervention:

resource "aws_ecs_service" "data_processor" {
  name            = "data-processor"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.data_processor.arn
  desired_count   = 1
  launch_type     = "FARGATE"
 
  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.ecs.id]
    assign_public_ip = false
  }
 
  tags = local.common_tags
}
 
resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 10
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.data_processor.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}
 
resource "aws_appautoscaling_policy" "scale_on_cpu" {
  name               = "scale-on-cpu"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace
 
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

The policy scales the service up when CPU utilization exceeds 70% and back down when demand drops, with a floor of one task. Scaling to zero with a CPU-based metric is not viable: when task count is 0, there is no CPU to measure and the policy has no signal to scale back out. True scale-to-zero requires an external trigger such as an SQS queue depth alarm driving a step scaling policy.

Storage

S3 object lifecycle rules are the primary lever for data storage costs. Without them, objects accumulate indefinitely in STANDARD storage regardless of how often they are accessed.

The general model is three tiers: hot (STANDARD for frequently accessed data), warm (STANDARD_IA after 30 days for data that is accessed occasionally), and cold (GLACIER_IR or DEEP_ARCHIVE for archival). A lifecycle rule that handles this automatically:

resource "aws_s3_bucket_lifecycle_configuration" "data_lifecycle" {
  bucket = aws_s3_bucket.data.id
 
  rule {
    id     = "transition-to-cheaper-tiers"
    status = "Enabled"
 
    filter {}
 
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
 
    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }
 
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
 
    expiration {
      days = 2555 # ~7 years; adjust to your data retention policy
    }
  }
}

The filter {} block with no prefix applies the rule to all objects in the bucket. If you want the rule to apply only to a specific prefix, replace it with filter { prefix = "raw/" }.

Three caveats worth knowing before applying this rule broadly. First, STANDARD_IA has a minimum billable storage duration of 30 days; objects deleted or transitioned away before that are charged for the full 30 days. Second, STANDARD_IA has a minimum billable object size of 128 KB; objects smaller than that are charged as if they are 128 KB. For buckets containing large volumes of small, short-lived files, STANDARD_IA may increase costs rather than reduce them. In those cases, skip the STANDARD_IA transition and move directly to GLACIER_IR once the objects are genuinely cold. Third, GLACIER_IR carries its own 90-day minimum storage duration, so objects transitioned or deleted before that threshold are charged for the full 90 days. DEEP_ARCHIVE has a 180-day minimum for the same reason; the lifecycle rule above transitions at day 365 so this is not a concern unless you shorten that value.

Beyond lifecycle rules, the default architecture for data storage should be: keep data in object storage as long as possible and only load it into a data warehouse when you need the additional query structure for serving. Storing everything in Redshift because it’s more convenient is one of the faster ways to accumulate a large bill.

General Infrastructure

Compute

For long-running workloads like Kubernetes node groups, persistent EC2 instances, or EMR clusters that run on a predictable schedule, Reserved Instances and Compute Savings Plans offer significant discounts over on-demand pricing. Standard Reserved Instances can reach up to 72 percent on three-year terms, in exchange for a one- or three-year commitment. The commitment sounds risky but is usually warranted for anything running continuously in production.

On the architecture side, keeping data close to where it’s consumed directly reduces egress costs. A compute layer that has to pull large datasets from another region is paying for that cross-region transfer on every run. Keeping source data and processing compute in the same region and the same availability zone where possible avoids a category of cost that is otherwise invisible until it shows up on the bill.

Storage

Managed databases like RDS, Aurora, and Redshift also support reserved pricing (Reserved DB Instances and Reserved Nodes respectively), and the savings are substantial for anything running around the clock. DynamoDB’s reserved capacity works differently: it only applies in provisioned throughput mode and is purchased per WCU/RCU unit rather than per instance. The mistake teams make is treating RI purchases as a one-time decision. RI portfolios need to be reviewed regularly, at least quarterly, to rightsize as capacity requirements shift. An RI purchased for a Redshift cluster that has since been replaced by Athena is pure wasted spend.

The common thread across all of this is that cost control is an operational discipline, not a one-time cleanup exercise. Start with the right mental model: you pay for what exists. Enforce a tagging strategy from day one. Set up AWS Data Exports and build the tooling to make costs queryable before someone sends an urgent email about the bill. The teams that do this work up front spend far less time on expensive, time-pressured migrations later.

If you take away two things: consider costs from the start of any new project, and standardize your tagging strategy before you have more than a handful of resources to retrofit it onto.

Frequently Asked Questions

Why should AWS cost management be proactive rather than reactive?

Reactive cost management forces expensive, time-pressured migrations. Engineering teams that treat cost as a first-class concern from the start make better architectural decisions and avoid retrofitting infrastructure that was never designed to be cost-efficient. A mandate that arrives after overspending is already too late to prevent the hard work.

How do I identify which AWS workloads are driving costs?

Set up AWS Data Exports with resource-level detail enabled and write it to S3. Query it with Athena, filtering by your cost allocation tags. Without a tagging strategy in place first, workload-level attribution is not possible from the bill alone. Cost Explorer can only tell you that EC2 spent $14,000, not which team or pipeline drove it.

When should I use Reserved Instances instead of on-demand pricing?

Use Reserved Instances for any workload running continuously in production: persistent EC2 instances, always-on RDS databases, or EMR clusters on a predictable schedule. Standard Reserved Instances can reduce costs by up to 72% on three-year terms. Anything with variable or unpredictable demand is better left on-demand or on Spot Instances.

What is the cheapest way to run data transformation on AWS?

For datasets under a few hundred gigabytes, DuckDB or Polars on a single EC2 instance is significantly cheaper than a Spark or EMR cluster and often faster. For SQL-based transformations over S3 data, Athena with a dbt connector is another cost-effective option. Reserve distributed compute for genuinely large-scale workloads.

How do I prevent S3 storage costs from accumulating silently?

Set S3 lifecycle rules to transition objects to STANDARD_IA after 30 days, GLACIER_IR after 90 days, and DEEP_ARCHIVE after 365 days. Without lifecycle rules, all objects stay in STANDARD storage indefinitely regardless of access frequency. The cloud does not clean up after you.

What is the difference between Cost Explorer and AWS Data Exports (CUR)?

Cost Explorer is a dashboard for spot-checking and high-level trends. AWS Data Exports (the replacement for the legacy Cost and Usage Report) is a flat-file export of every line item on your bill at the resource level, written to S3. It includes all activated cost allocation tags as columns and is the right tool for workload-level attribution, building BI dashboards, and anomaly detection. Data Exports supports both the legacy CUR column schema and the newer FOCUS 1.0 open standard format.

Get practical data engineering notes in your inbox