DEV Community

Cover image for Bad Data Quality Costs More Than a Slow Query
Mark Yu
Mark Yu

Posted on • Edited on

Bad Data Quality Costs More Than a Slow Query

Bad data usually does not explode.

It quietly poisons reports, recommendations, billing, search, and AI features until nobody trusts the system.

That loss of trust is expensive.

What Bad Data Looks Like

Not all bad data is obviously broken.

Problem Example
Missing value email = null for active users
Invalid format phone number stored three different ways
Duplicate entity same customer created twice
Stale value subscription says active after cancellation
Conflicting source CRM and billing disagree
Schema drift event payload changes without warning

The worst bugs are the ones that still produce a dashboard.

Add Validation at the Boundary

Do not let obviously invalid data enter the system.

import { z } from "zod";

const SignupSchema = z.object({
  email: z.string().email(),
  plan: z.enum(["free", "pro", "team"]),
});

const input = SignupSchema.parse(request.body);
Enter fullscreen mode Exit fullscreen mode

This is cheaper than cleaning the warehouse later.

Add Constraints in the Database

App validation is not enough.

ALTER TABLE users
ADD CONSTRAINT users_email_unique UNIQUE (email);

ALTER TABLE subscriptions
ADD CONSTRAINT subscriptions_status_check
CHECK (status IN ('trial', 'active', 'canceled'));
Enter fullscreen mode Exit fullscreen mode

The database should reject impossible states.

Track Data Quality Like Production Health

Useful checks:

SELECT COUNT(*) FROM users WHERE email IS NULL;

SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

SELECT COUNT(*)
FROM subscriptions
WHERE status = 'active'
  AND canceled_at IS NOT NULL;
Enter fullscreen mode Exit fullscreen mode

These are not glamorous, but they catch real problems.

Ownership Matters

Every important field needs an owner.

If nobody owns customer_status, nobody knows whether billing, CRM, support, or the product database is allowed to change it.

Visual map:

source system -> validation -> database -> event/log -> analytics
Enter fullscreen mode Exit fullscreen mode

If quality breaks at the source, downstream tools only make the wrong answer prettier.

Why This Matters More With AI

AI agents and retrieval systems make bad data more visible.

If your internal docs, tickets, metrics, and customer records are messy, an AI assistant will confidently retrieve messy context.

In 2026, data quality is not just a BI problem. It is also an AI reliability problem.

Final Thought

Poor data quality is engineering debt with a business disguise.

Fix it close to where the data enters the system, give fields clear ownership, and monitor quality like you monitor latency.

What data quality issue created the most confusion in one of your projects?

Top comments (0)