DEV Community

Olawale Afuye
Olawale Afuye

Posted on

Your Ticket Was Closed. The User Still Couldn't Pay.

Your backend returned 200.

The mobile app showed an error.

The user tapped "Pay" three times.

Three pending charges hit their account. One order was placed. Their balance was short. And your incident log showed zero failures.

Every engineer on the team did their job. Nobody solved the problem.

This is the most common way engineering teams fail, not through incompetence, but through excellent execution of the wrong unit of work. And until you recognise the difference between completing a task and solving a business problem, you will keep shipping systems that work perfectly and experiences that don't.


The Ticket-Thinker vs. The System-Owner

Most engineers early in their careers think in tickets.

Ticket assigned → code written → tests pass → PR merged → ticket closed. Done.

This is fine when you're learning. It's a liability when you're trying to grow.

The engineer who closes tickets is useful. The engineer who asks "what problem does this ticket actually solve, and am I solving it in the right place?" that engineer is dangerous in the best way.

Here's the distinction in practice.

The backend engineer builds a payment endpoint. It processes charges correctly, returns the right status codes, has proper error handling. 100% test coverage. Ticket closed.

The mobile engineer builds the payment screen. It calls the endpoint, handles the response, shows confirmation or error. Smooth UI. Ticket closed.

The problem nobody owned: what happens when the network drops after the backend processes the charge but before the mobile app receives the confirmation?

The backend: charge processed. No error.
The mobile: timeout. Shows "Payment failed." User retries.
The user: charged twice.

Both engineers solved their assigned problem correctly. The business problem — charge the user once and confirm it reliably — went unsolved. Because that problem lived in the space between their tickets, and nobody was watching that space.


Real Scenario 1: The Payment That Worked and Failed at the Same Time

This happens in production more than any team admits.

In a payment flow, the sequence is: mobile initiates → backend charges → payment processor confirms → backend responds → mobile confirms to user.

Network latency exists at every arrow in that chain.

If the connection between the backend and mobile drops after the payment processor confirms but before the backend responds to the mobile, both the backend log and the payment processor log show success. The mobile app shows "Payment failed. Please try again."

A user who trusts the mobile app retries. Now they're charged twice.

The fix isn't purely a backend fix. It isn't purely a mobile fix. It requires:

Idempotency keys — the mobile generates a unique key per payment attempt and sends it with every request. The backend uses it to guarantee that retrying the same request never creates a duplicate charge, regardless of how many times the network drops and retries.

// Mobile: generate and persist the idempotency key per payment intent
const idempotencyKey = `pay_${userId}_${orderId}_${Date.now()}`;

// Store it locally before the request
localStorage.setItem('pending_payment_key', idempotencyKey);

// Send with every retry of this specific payment
const response = await fetch('/api/payments', {
  method: 'POST',
  headers: {
    'Idempotency-Key': idempotencyKey,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ amount, currency, orderId })
});
Enter fullscreen mode Exit fullscreen mode
// Backend: check for existing successful charge with this key
async function processPayment(req) {
  const idempotencyKey = req.headers['idempotency-key'];

  const existing = await db.payments.findOne({ idempotencyKey });
  if (existing?.status === 'success') {
    return existing; // Return the same result. Don't charge again.
  }

  const charge = await paymentProcessor.charge(req.body);
  await db.payments.create({ idempotencyKey, ...charge });
  return charge;
}
Enter fullscreen mode Exit fullscreen mode

This solution only exists if a backend engineer and mobile engineer sat down together and asked: what does the user experience look like when the network misbehaves? Not: does my component work?

That's the difference.


Real Scenario 2: The Smart Device That "Works"

A team builds a smart home device. Hardware, mobile app, cloud backend, three separate engineering workstreams.

The hardware engineer ships firmware that correctly sends state changes to the cloud API. Tests pass. Ticket closed.

The mobile engineer ships an app that correctly receives state changes from the cloud and updates the UI. Tests pass. Ticket closed.

The backend engineer ships an API that receives from hardware and sends to mobile. Load tested. Ticket closed.

Users buy the device. They press the button to turn on their light.

The light turns on 11 seconds later.

Nobody's system is broken. The latency was distributed across three components, each one individually fine, each one adding 3–4 seconds of its own processing and polling delay. Nobody measured the end-to-end journey. Nobody owned the number that the user actually experiences: the time between button press and light turning on.

The product reviews say "laggy" and "unresponsive." The engineering team looks at their metrics and sees nothing wrong.

This is what happens when reliability is treated as a component property instead of a system property.

Real reliability — the kind users actually experience only exists at the intersection of every layer. The backend can be 99.9% available. If the mobile SDK polls every 5 seconds, the effective user-facing response time is up to 5 seconds before the backend is even consulted. Hardware transmission latency on top of that. Cloud-to-mobile push latency on top of that.

The only way to catch this is to instrument the entire journey, not individual components:

// Instrument the user-facing journey end to end
// Not just "did the API respond?" but "did the user get feedback?"

const journeyStart = performance.now();

await hardwareCommandAPI.send(deviceId, 'toggle_light');

// Poll for state change confirmation from device
await waitForDeviceStateChange(deviceId, 'on', { timeoutMs: 2000 });

const journeyEnd = performance.now();
const userFacingLatency = journeyEnd - journeyStart;

metrics.record('light_toggle_user_latency_ms', userFacingLatency);
Enter fullscreen mode Exit fullscreen mode

When this number starts living in your dashboards, cross-functional conversations change. "The API is fast" stops being the end of the discussion.


Why Engineers Stay Stuck in the Ticket Mindset

It's not laziness. It's incentive structure.

Most engineering teams measure and reward what's visible: tickets closed, PRs merged, features shipped, uptime of individual services.

Nobody measures "how many times did an engineer spot a problem outside their lane and raise it?" Nobody gives performance review credit for the mobile engineer who asked the backend team: "what happens to our payment UI if your charge endpoint takes 8 seconds instead of 200ms?" And then followed up with: "here's what the user sees, here's the drop-off in our funnel."

The ticket system creates invisible walls between components. Each engineer optimizes for their component. The user lives in the space between the walls and has no advocate unless someone consciously takes on that role.

One of the clearest signs of engineering maturity is the ability to think beyond the ticket and own the user outcome.

Not deeper technical expertise in one domain. The willingness to hold the end-to-end user journey in your head while working in one specific layer of it.


What Cross-Functional Reliability Actually Looks Like

Collaboration here doesn't mean more meetings. It means shared ownership of outcomes rather than outputs.

Practically, this looks like:

Defining end-to-end SLOs, not just component SLOs.
Your backend's 99.9% availability means nothing to a user whose mobile app never got the response. Define what the user-facing journey reliability looks like and measure it across every layer together.

Writing integration tests that simulate the user, not the component.

// Component test (insufficient):
test('payment endpoint returns 200', async () => {
  const res = await request(app).post('/payments').send(payload);
  expect(res.status).toBe(200);
});

// Integration test (what actually matters):
test('user can complete payment even on slow network', async () => {
  // Simulate 3G latency
  networkCondition.set({ latency: 500, packetLoss: 0.05 });

  const result = await simulateUserPaymentJourney({
    userId: 'test-user',
    amount: 5000,
    retryOnTimeout: true
  });

  expect(result.charged).toBe(true);
  expect(result.chargeCount).toBe(1); // Exactly once. Not zero. Not two.
  expect(result.userConfirmed).toBe(true);
});
Enter fullscreen mode Exit fullscreen mode

Doing joint failure mode analysis before shipping, not after an incident.
Get backend, mobile, and hardware/infrastructure engineers in the same room with one question: what happens to the user if this part fails? Run through every component. Write down what the user experiences at each failure point. Fix the ones that are unacceptable.

Instrumenting the user journey, not just the service.
Every system already has dashboards showing API response times, error rates, DB query performance. How many have a dashboard showing: user tapped Pay → charge confirmed → confirmation visible to user, with a latency distribution for the whole sequence? Build that one. It will tell you things your component dashboards never will.


The Mental Model Shift

Here's the reframe that changes how you approach your work:

Your job title describes your skill set. It doesn't describe the boundary of your responsibility.

You are a backend engineer who is responsible for users being able to pay reliably. You are a mobile engineer who is responsible for users having confidence in the product. You are a hardware engineer who is responsible for users trusting that the physical interaction works.

The moment you accept that your responsibility extends to the user outcome not just the technical component, you start asking different questions. You start talking to engineers in other layers. You start caring about what your API response time does to the mobile engineer's loading UX. You start caring about what the mobile engineer's retry logic does to your backend's duplicate-detection. You start caring about what the hardware's transmission delay does to the entire chain.

This isn't extra work. This is the actual work.

Closing tickets is a floor, not a ceiling. The engineers who grow fast are the ones who figure that out early. 🔧


What's the most painful cross-stack failure you've shipped or inherited? The ones where every component technically worked and the user still got hurt are the best learning stories. Drop them in the comments.

Top comments (0)