Mark Yu

Posted on Oct 16, 2024 • Edited on Jun 16

Bad Data Quality Costs More Than a Slow Query

#data #backend #database #engineering

Bad data usually does not explode.

It quietly poisons reports, recommendations, billing, search, and AI features until nobody trusts the system.

That loss of trust is expensive.

What Bad Data Looks Like

Not all bad data is obviously broken.

Problem	Example
Missing value	`email = null` for active users
Invalid format	phone number stored three different ways
Duplicate entity	same customer created twice
Stale value	subscription says active after cancellation
Conflicting source	CRM and billing disagree
Schema drift	event payload changes without warning

The worst bugs are the ones that still produce a dashboard.

Add Validation at the Boundary

Do not let obviously invalid data enter the system.

import { z } from "zod";

const SignupSchema = z.object({
  email: z.string().email(),
  plan: z.enum(["free", "pro", "team"]),
});

const input = SignupSchema.parse(request.body);

This is cheaper than cleaning the warehouse later.

Add Constraints in the Database

App validation is not enough.

ALTER TABLE users
ADD CONSTRAINT users_email_unique UNIQUE (email);

ALTER TABLE subscriptions
ADD CONSTRAINT subscriptions_status_check
CHECK (status IN ('trial', 'active', 'canceled'));

The database should reject impossible states.

Track Data Quality Like Production Health

Useful checks:

SELECT COUNT(*) FROM users WHERE email IS NULL;

SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

SELECT COUNT(*)
FROM subscriptions
WHERE status = 'active'
  AND canceled_at IS NOT NULL;

These are not glamorous, but they catch real problems.

Ownership Matters

Every important field needs an owner.

If nobody owns customer_status, nobody knows whether billing, CRM, support, or the product database is allowed to change it.

Visual map:

source system -> validation -> database -> event/log -> analytics

If quality breaks at the source, downstream tools only make the wrong answer prettier.

Why This Matters More With AI

AI agents and retrieval systems make bad data more visible.

If your internal docs, tickets, metrics, and customer records are messy, an AI assistant will confidently retrieve messy context.

In 2026, data quality is not just a BI problem. It is also an AI reliability problem.

Final Thought

Poor data quality is engineering debt with a business disguise.

Fix it close to where the data enters the system, give fields clear ownership, and monitor quality like you monitor latency.

What data quality issue created the most confusion in one of your projects?

DEV Community