ECS Fargate as a Migration Bridge: Running Two Orchestrators at Once

#aws #ecs #eks #fargate

Originally published on graycloudarch.com.

Three months into the EKS buildout, someone asked a reasonable question: do we actually need all of this right now?

The cluster was running. The services were containerized. But the team was also operating cert-manager, an ingress controller, external-secrets-operator, and Karpenter — each with its own version compatibility matrix, each capable of generating its own 2am incident. None of it was directly related to shipping the product.

We made the decision to migrate to ECS Fargate first, with EKS as a future destination if and when the operational capacity caught up. Not a retreat — a deliberate two-step. The container images were already built. The IAM patterns were transferable. The application code hadn't changed. Only the orchestration layer was moving.

This is what that migration looked like, and why running both orchestrators simultaneously during the transition was the right pattern.

Why not skip straight to EKS

The decision framework for ECS vs. EKS is covered in a prior post — if you've already worked through that, skip ahead. The short version relevant here: EKS adds roughly fifteen operational concepts on top of running a service, each capable of failing independently. The bridge pattern is for teams where the orchestration question and the containerization question are both open at the same time. Trying to answer them together multiplies the blast radius.

The ECS → EKS migration later is largely mechanical. Task definitions become Helm charts, task roles become IRSA service account annotations, ALB target group registration becomes ingress controller configuration. The container image — the actual artifact — doesn't change. Build ECS as if you'll migrate it, and you will.

What the ECS foundation looks like in Terraform

Three modules compose to support any service:

# Shared per cluster
module "ecs_cluster" {
  source = "./modules/ecs-cluster"

  name               = "platform-prod"
  log_retention_days = 30
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]
}

# Per service — IAM task role with least-privilege access
module "api_task_role" {
  source = "./modules/ecs-task-role"

  service_name   = "api"
  environment    = "prod"
  secrets_arns   = [aws_secretsmanager_secret.api_db.arn]
  ecr_account_id = var.shared_services_account_id
}

# Per service — ECS service + ALB registration
module "api_service" {
  source = "./modules/ecs-service"

  cluster_arn     = module.ecs_cluster.arn
  task_role_arn   = module.api_task_role.arn
  image           = "${var.ecr_registry}/api:${var.image_tag}"
  cpu             = 512
  memory          = 1024
  desired_count   = 2
  target_group_arn = aws_alb_target_group.api.arn

  environment_variables = {
    APP_ENV = "production"
  }

  secrets = {
    DB_PASSWORD = aws_secretsmanager_secret.api_db.arn
  }
}

The design constraint that matters most: keep the three modules independent. Don't build a composite "ecs-app" module that wraps all three. Independent modules mean each service can tune its task role and scaling behavior without touching the cluster, and the cluster can be upgraded without touching service configurations.

Cross-account ECR: the gotcha that hits every team

ECR lives in a shared-services account. ECS runs in the workloads account. This is standard multi-account architecture — and it means the ECS task execution role needs cross-account pull permissions that are easy to get wrong.

Two pieces are required:

# In the workloads account: task execution role policy
data "aws_iam_policy_document" "ecr_cross_account" {
  statement {
    actions = [
      "ecr:GetAuthorizationToken",
    ]
    resources = ["*"]  # GetAuthorizationToken is global; can't be scoped to a registry
  }

  statement {
    actions = [
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage",
    ]
    resources = [
      "arn:aws:ecr:us-east-1:${var.shared_services_account_id}:repository/*"
    ]
  }
}

// In the shared-services account: ECR repository policy
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::WORKLOADS_ACCOUNT_ID:root"
    },
    "Action": [
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage"
    ]
  }]
}

The common failure mode: the task execution role has the right IAM policy, but the ECR repository policy in the shared-services account doesn't grant the workloads account access. ECS pulls look like a permissions error, and the error message ("no basic auth credentials") is not helpful in pointing to the repository policy as the cause.

Logging: what changes from EKS

On EKS, Fluent Bit runs as a DaemonSet — one per node, automatically collecting logs from every container. On ECS Fargate, there is no shared host and no DaemonSet. You configure logging per task definition.

The simplest approach, and the right default for most services, is the awslogs driver:

"logConfiguration": {
  "logDriver": "awslogs",
  "options": {
    "awslogs-group": "/ecs/api",
    "awslogs-region": "us-east-1",
    "awslogs-stream-prefix": "ecs",
    "awslogs-create-group": "true"
  }
}

This sends all stdout/stderr from the container directly to CloudWatch. No sidecar, no additional IAM, no configuration beyond the task definition. The awslogs-create-group: true option creates the log group automatically if it doesn't exist — useful during initial deployment.

For services that need to ship logs to multiple destinations or apply structured filtering, FireLens is the right choice: a Fluent Bit or Fluentd container runs as a sidecar in the same task and routes logs where they need to go. The operational overhead is higher, but the routing flexibility is real.

Verify logging works before cutting traffic: aws logs tail /ecs/api --follow while a test request hits the new ECS service. If nothing appears, the task role is missing CloudWatch write permissions or the log group name doesn't match.

Running both orchestrators during the soak period

We migrated all production services to ECS Fargate, but we kept EKS running throughout a soak period. Not as a fallback — as a confirmed, immediate revert target.

The migration sequence for each service:

Deploy service on ECS Fargate, validate health checks and task stability
Cut DNS to the new ALB (see the companion post on zero-downtime DNS cutover)
Monitor for 72 hours: error rates, latency p99, ALB healthy host count
If metrics are nominal after 72 hours, deprovision from EKS

During the soak period, EKS was live and capable of receiving traffic within 60 seconds if the DNS record was reverted. This isn't a hypothetical backup — it was a committed operational state, with the rollback sequence documented and tested before we cut DNS.

The benefit of this pattern is that it changes the calculus on the cutover decision. If rollback requires re-provisioning on EKS from scratch, the team has every incentive to push through problems rather than revert. If rollback is "update one Route53 record and wait 60 seconds," the team can move fast and revert at the first real signal.

We didn't need to revert. But having the option meant we could make the migration decision cleanly.

The ECS Anywhere variation: running both indefinitely

For one service — a high-volume content delivery workload — the migration pattern extended beyond a time-limited soak period. That service runs on both ECS Fargate and ECS Anywhere simultaneously, with the ability to shift traffic between them at any time.

ECS Anywhere extends ECS to on-premises or edge nodes, registered as EXTERNAL capacity providers. The same ECS service, task definitions, and IAM patterns apply — what changes is the capacity provider:

resource "aws_ecs_service" "delivery" {
  name            = "delivery-central"
  cluster         = aws_ecs_cluster.platform.id
  task_definition = aws_ecs_task_definition.delivery.arn
  desired_count   = var.desired_count

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = var.fargate_weight  # adjust to shift traffic
    base              = 0
  }

  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.anywhere.name
    weight            = var.anywhere_weight
    base              = 0
  }
}

Shifting between Fargate and Anywhere is a Terraform variable change — no service restart, no DNS change, no downtime. The service is always running on both; only the task distribution changes.

This pattern works well for workloads that need geographic proximity to edge infrastructure or where data sovereignty makes cloud-only deployment impractical. It also provides a genuine multi-region/multi-location deployment model without requiring a separate orchestrator.

When to stay on ECS

ECS Fargate is the right long-term answer — not just the bridge — when:

Service count is small (under roughly 15-20 services) and autoscaling requirements are straightforward target-tracking
The team's operational capacity doesn't yet support cluster-level operations: node group upgrades, admission controller management, custom scheduler configuration
Deploys via Terraform or CI/CD pipelines are acceptable and GitOps isn't a hard requirement
No hard requirement for KEDA, HPA with custom metrics, or cluster-level bin-packing

The ECS vs. EKS decision framework is covered in more detail in an earlier post. The short version: it's an operational capacity question, not a features comparison.

The bridge pattern is valuable precisely because it decouples the containerization decision from the orchestration decision. You can containerize now, on ECS, without betting that the team is ready to operate Kubernetes. When the team is ready — and that readiness is genuinely there, not aspirational — the migration from ECS to EKS is mostly mechanical. The hard work of containerizing the application is already done.

Running a platform migration and figuring out the container orchestration path? This is the kind of decision I work through with teams regularly. Get in touch.