Bad data usually does not explode.
It quietly poisons reports, recommendations, billing, search, and AI features until nobody trusts the system.
That loss of trust is expensive.
What Bad Data Looks Like
Not all bad data is obviously broken.
| Problem | Example |
|---|---|
| Missing value |
email = null for active users |
| Invalid format | phone number stored three different ways |
| Duplicate entity | same customer created twice |
| Stale value | subscription says active after cancellation |
| Conflicting source | CRM and billing disagree |
| Schema drift | event payload changes without warning |
The worst bugs are the ones that still produce a dashboard.
Add Validation at the Boundary
Do not let obviously invalid data enter the system.
import { z } from "zod";
const SignupSchema = z.object({
email: z.string().email(),
plan: z.enum(["free", "pro", "team"]),
});
const input = SignupSchema.parse(request.body);
This is cheaper than cleaning the warehouse later.
Add Constraints in the Database
App validation is not enough.
ALTER TABLE users
ADD CONSTRAINT users_email_unique UNIQUE (email);
ALTER TABLE subscriptions
ADD CONSTRAINT subscriptions_status_check
CHECK (status IN ('trial', 'active', 'canceled'));
The database should reject impossible states.
Track Data Quality Like Production Health
Useful checks:
SELECT COUNT(*) FROM users WHERE email IS NULL;
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
SELECT COUNT(*)
FROM subscriptions
WHERE status = 'active'
AND canceled_at IS NOT NULL;
These are not glamorous, but they catch real problems.
Ownership Matters
Every important field needs an owner.
If nobody owns customer_status, nobody knows whether billing, CRM, support, or the product database is allowed to change it.
Visual map:
source system -> validation -> database -> event/log -> analytics
If quality breaks at the source, downstream tools only make the wrong answer prettier.
Why This Matters More With AI
AI agents and retrieval systems make bad data more visible.
If your internal docs, tickets, metrics, and customer records are messy, an AI assistant will confidently retrieve messy context.
In 2026, data quality is not just a BI problem. It is also an AI reliability problem.
Final Thought
Poor data quality is engineering debt with a business disguise.
Fix it close to where the data enters the system, give fields clear ownership, and monitor quality like you monitor latency.
What data quality issue created the most confusion in one of your projects?
Top comments (0)