Paulo Victor Leite Lima Gomes

Posted on Jun 17

aws finops agent makes cloud cost a runtime problem

#ai #agents #aws #finops

Cloud cost management has always had a strange emotional profile.

Everyone agrees it matters. Almost nobody wants to do it. The dashboard is there. The reports exist. The recommendations are technically available. The finance team asks reasonable questions. The engineering team says it will look after the launch. Then the launch becomes two launches, the old environment is still running, the database class is still too large, and the mystery line item in the bill becomes a recurring calendar invite.

This is why AWS FinOps Agent caught my attention.

The announcement is still preview-stage, so I would not build a religion around it. But the shape is important. AWS describes an agent that can answer cost questions, surface optimization opportunities, investigate anomalies, run recurring FinOps workflows, generate reports, open Jira tickets, and post findings to Slack.

That is not just "chat with your cloud bill."

That is cost management moving from dashboard archaeology to operational workflow.

And the moment an agent can turn a cost recommendation into a ticket, a Slack message, or a recurring investigation, cloud cost stops being only a reporting problem.

It becomes a runtime problem.

dashboards were never enough

The cloud cost dashboard is a classic enterprise compromise.

It gives everyone a place to point. It rarely gives anyone enough momentum to act.

A dashboard can tell you spend went up. It can show that a service is idle, that a cluster is oversized, that a Savings Plan might help, or that storage grew faster than expected. That is useful information. But between "useful information" and "somebody changed production safely" there is a lot of missing work.

Someone has to decide whether the recommendation is valid.

Someone has to know which team owns the resource.

Someone has to understand whether the workload has a weird traffic pattern, a compliance constraint, a migration in progress, or a customer promise hidden in a Slack thread.

Someone has to open the ticket, chase the owner, make the change, verify the result, and explain to finance why the saving is real or why it is not.

That gap is why FinOps is not just analytics. It is operations.

An agent is interesting here because it can sit in the gap. It can connect the cost signal to the workflow where engineering teams actually make decisions.

That is the useful version.

The dangerous version is the same agent confidently generating noise at scale.

recommendations are not decisions

Cloud optimization tools have always had a translation problem.

"This instance is underutilized" is not the same as "resize it now."

"This database looks idle" is not the same as "delete it."

"A Savings Plan might reduce spend" is not the same as "commit the company to this usage shape."

The recommendation is a clue. The decision needs context.

AWS says FinOps Agent can pull from Cost Optimization Hub and Compute Optimizer, generate reports, surface rightsizing, idle resource, and Savings Plans recommendations, and create Jira tickets from those recommendations. That is exactly where the line matters.

Opening a ticket is fine. It can be useful. The ticket should contain the evidence, expected saving, owner, affected resources, confidence level, and the reason the agent believes the recommendation is safe enough to review.

But the ticket is not the decision.

The decision belongs to whoever owns the service, the budget, or the operational risk.

This sounds obvious until teams start measuring the agent by how much work it creates. A FinOps agent that opens a hundred tickets is not automatically successful. It may have simply exported dashboard noise into Jira.

The better metric is boring: how many recommendations turned into safe, verified savings without creating operational incidents or review fatigue?

anomaly investigation is where this gets real

The most interesting part of the AWS description is not reports. It is anomaly investigation.

Cost anomalies are annoying because they are often urgent but ambiguous. Spend moved. Something changed. Maybe it is a product launch. Maybe it is an abuse pattern. Maybe a test environment leaked. Maybe a retry loop started. Maybe a data pipeline reprocessed more than expected. Maybe a team intentionally scaled something and forgot to tell anyone.

The first hour is usually context gathering.

Which account? Which service? Which region? Which tags? Which deployment happened around the same time? Which team owns it? Is the spend still growing? Is there a customer-facing impact? Is this actually abnormal for month-end, batch processing, or a marketing campaign?

That is good agent work if the boundaries are clear.

An agent that can collect evidence, summarize likely causes, and post findings to Slack can save time. The human does not need to start from a blank dashboard. The incident or FinOps channel gets a first pass with links, affected resources, and next actions.

But the agent needs to show its work.

If it says the root cause is a data pipeline, I want the query trail. If it says a deployment correlates with the spike, I want the deployment link. If it says the spend is limited to one account and one region, I want the exact filters. If it recommends pausing something, I want a human approval gate before action.

For cost anomalies, confidence without evidence is just a faster way to be wrong.

slack is not an accountability model

Posting findings to Slack is useful.

It is also not enough.

A Slack message is a notification. It is not ownership. It is not state. It is not proof that the work was done. It is not a durable record of why a cost decision was accepted or rejected.

The serious version of a FinOps agent needs a trail across systems:

the detected anomaly
the data sources used
the affected accounts, services, regions, and tags
the generated recommendation
the ticket or issue created
the owner assigned
the approval or rejection
the actual change
the measured impact after the change

Without that, the organization gets a more talkative cost dashboard.

With that, the agent becomes part of the operating system for cloud cost.

This is where engineering and finance need to be careful. The agent should not become a way for finance to spray tickets at engineering. It should also not become a way for engineering to ignore cost because "the agent will tell us."

The agent is coordination infrastructure.

Coordination still needs ownership.

recurring workflows need budgets too

Recurring FinOps workflows sound great.

Every Monday, generate the cost report. Every day, inspect anomalies. Every week, find idle resources. Every month, check commitment coverage. This is the kind of work that benefits from automation because humans are bad at doing repetitive analysis with consistent patience.

But recurring agent work can quietly become another bill.

The agent uses compute. It calls APIs. It may use model tokens. It may query observability systems. It may open tickets that consume human review time. It may create work that looks productive but does not pay for itself.

So the agent itself needs FinOps discipline.

How much does the recurring workflow cost to run? How many useful actions did it produce? How many false positives? How many recommendations were already known? How much verified saving came from the workflow? Which reports are read by humans, and which ones are just ritual?

If the agent is supposed to reduce waste, it should not be exempt from the same question.

Is this work worth what it costs?

what i would build first

If I were introducing a FinOps agent inside a company, I would avoid the heroic version.

I would not start with "optimize all AWS spend."

I would start with one narrow loop:

one business unit
one set of tagged accounts
one class of recommendation
one ticket template
one Slack channel
one human approval path
one measured saving target

Idle non-production resources are a good candidate. Low-risk rightsizing recommendations might be another. Savings Plans recommendations are tempting, but I would treat commitment decisions as a higher-governance workflow because the failure mode is different.

The first goal should be to prove that the agent can turn a cost signal into an owned, reviewable, measurable action.

Not a pretty report.

An action.

Then I would measure the rejection reasons. That is where the system improves. If owners reject recommendations because tags are wrong, the platform problem is tagging. If they reject them because the agent lacks context about expected traffic, the problem is context. If they ignore them because there are too many tickets, the problem is workflow design.

The agent will expose the messy parts of your cloud operating model.

That is useful if you are willing to look.

the punchline

AWS FinOps Agent is a good signal for where cloud operations are going.

The dashboard is not disappearing, but the center of gravity is moving. Cost data is becoming something agents can reason over, schedule, summarize, route, and attach to operational workflows.

That can be genuinely useful. Cloud waste is real, anomaly response is tedious, and many cost recommendations die because nobody turns them into owned work.

But the useful version is not "let the agent manage the bill."

The useful version is an accountable workflow: evidence, ownership, approvals, ticket state, measured impact, and a clean boundary between recommendation and decision.

Cloud cost has always been a shared responsibility, which is a polite way of saying it often belongs to everyone and nobody at the same time.

Agents will not fix that by themselves.

They will make the missing ownership visible faster.

That is still progress.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

Top comments (1)

Trigops • Jun 19

The framing hits something real: once you give an agent the authority to act on cloud resources, the problems that surface aren't technical — they're operational. Who owns a decision to terminate or resize something? What's the blast radius if the agent misidentifies a production workload? What does the audit trail look like when a cost anomaly gets investigated six months later?

Schedule-based or metric-based triggers are tractable because the logic is static and inspectable. Agentic workflows change that: the decision path isn't a cron job you can read, it's a chain of reasoning steps. That makes approval gates and explainability first-class requirements, not afterthoughts.

The teams I've seen handle this well treat the agent as a recommender first, actor second — it surfaces the action and the reasoning, a human or policy approves it, and then it executes. The audit trail is a byproduct of that flow, not something bolted on later. Getting comfortable with that separation before expanding the agent's autonomy seems like the right order of operations.