Wang - C++ Developer

Posted on Jun 20

C++ Crash Pattern S5 — Race‑Condition Crashes: How to Diagnose and Fix Them

#cpp #legacy #multithreading #programming

Why race‑condition crashes only appear under specific timing, vanish under debugging, and produce different stack traces on every run.

Race‑condition crashes are the first pattern in the series where the failure is not deterministic.

The code is correct, the memory is valid, the backtrace may look reasonable, and the crash may even disappear under debugging — but the program still fails unpredictably.

S5 crashes are timing‑dependent failures caused by two or more threads accessing shared state without proper synchronization.

Unlike S4 (wrong‑thread crashes), S5 failures depend on when things happen, not where they happen.

This article explains how to recognize race‑condition crashes, diagnose them efficiently, and fix the underlying concurrency defects.

1. What Is a Race‑Condition Crash?

A race‑condition crash occurs when:

multiple threads access shared state
at least one access is a write
the accesses are not synchronized
the outcome depends on timing

The crash is not tied to a specific line of code.

It is tied to interleavings — the order in which threads execute.

S5 crashes are timing failures: the code is correct in isolation, but incorrect when interleavings change.

2. What Race‑Condition Crashes Look Like

Race‑condition crashes have a distinctive signature:

1. Crash location moves
The crash may appear in different functions on different runs.

2. Crash disappears under debugging
Breakpoints, logging, or sanitizers change timing and hide the bug.

3. Crash frequency depends on load
More threads → more failures.
Single‑threaded mode → no failures.

4. Backtrace may look valid or corrupted
Sometimes clean, sometimes garbage — depends on the interleaving.

5. Reproduction is difficult
We may need stress tests, loops, or special timing to trigger it.

6. The crashing line is rarely the root cause
The defect is almost always upstream, in a missing lock or incorrect ownership rule.

Why S5 Nondeterminism Is Different from S2/S3

Race‑condition crashes are nondeterministic because the failure depends on timing, not on corrupted memory.
This is different from S2 and S3: heap and stack corruption also appear nondeterministic, but for a different reason — the program state is already broken, so the crash location moves as corrupted data flows through the system.

In S5, the program is correct in isolation, but incorrect when two threads interleave in the wrong order -- timing matters.
The nondeterminism comes from the scheduler, not from memory corruption.

3. Likely Patterns — Root Causes

Race‑condition crashes come from a small set of mechanisms:

1. Unsynchronized read/write access
One thread writes while another reads.

2. Double‑delete or premature delete
One thread destroys an object while another still uses it.

3. Incorrect use of atomics
Atomics fix visibility, not invariants.
We can still race on multi‑field state.

4. Missing or incorrect locking
Lock not taken, taken too late, or taken in the wrong order.

5. Data structures not designed for concurrency
Vectors, maps, lists, and custom objects are not thread‑safe by default.

6. Races inside callbacks
Callbacks fire on different threads and access shared state.

7. Races in lifetime management
Weak pointer promoted too late, shared pointer destroyed too early.

4. Diagnostic Techniques

Debugging S5 means debugging timing, not memory.

1. Reproduce under stress
Increase thread count, reduce delays, run loops, or use stress harnesses.

2. Use TSAN (Thread Sanitizer)
TSAN is the single most effective tool for detecting data races.

3. Add temporary logging
But be aware: logging changes timing and may hide the bug.

4. Look for shared state
Any object accessed by multiple threads is suspicious.

5. Check lifetime boundaries

Who owns the object?
Who destroys it?
Is destruction synchronized?

6. Examine invariants
Multi‑field invariants require locks, not atomics.

7. Reproduce with forced scheduling
Pin threads, add artificial delays, or use deterministic schedulers.

5. Remediation Steps

Fixing S5 means strengthening synchronization and ownership rules.

1. Add locks around shared state
Mutexes, shared_mutex, or custom guards.

2. Use message‑passing instead of shared state
Push work to the owning thread.

3. Strengthen lifetime management
Use shared_ptr/weak_ptr carefully.
Destroy objects only when no thread can access them.

4. Avoid “lock‑free” unless we truly need it
Lock‑free code is extremely hard to get right.

5. Use atomics only for simple state
Atomics do not protect invariants across multiple fields.

6. Make thread ownership explicit
Document which thread owns which object.

6. Example 1 — Unsynchronized Access to Shared State

A classic race: two threads modify a shared vector.

Code

std::vector<int> data;

void writer() {
    for (int i = 0; i < 1000; ++i) {
        data.push_back(i);
    }
}

void reader() {
    for (int i = 0; i < 1000; ++i) {
        int x = data[i];   // sometimes valid, sometimes crash
    }
}

Symptom

Sometimes works
Sometimes crashes
Sometimes SIGSEGV
Sometimes out‑of‑range
Sometimes corrupted data

Diagnostic Path

1. Reproduce under stress → inconsistent results → timing issue.
2. Identify shared state → data accessed by two threads.
3. Check for synchronization → none; vector is not thread‑safe.
4. Confirm with TSAN (optional) → write/read race on data.

This leads directly to the root cause: unsynchronized access to a non‑thread‑safe container.

Root Cause

std::vector is not thread‑safe.
Concurrent push_back and read cause reallocation and invalidation.

Fix

std::mutex m;

void writer() {
    std::lock_guard<std::mutex> lock(m);
    data.push_back(...);
}

void reader() {
    std::lock_guard<std::mutex> lock(m);
    int x = data[i];
}

7. Example 2 — Lifetime Race (Use‑After‑Free)

A worker thread uses an object after another thread destroys it.

Code

struct Job {
    void run() { /* ... */ }
};

Job* job = new Job();

void worker() {
    job->run();   // sometimes valid, sometimes UAF
}

void cleanup() {
    delete job;   // races with worker
}

Symptom

Crash location moves
Sometimes SIGSEGV
Sometimes SIGABRT
Sometimes no crash

Diagnostic Path

1. Observe nondeterminism → suggests lifetime race.
2. Check ownership → job shared by worker + cleanup.
3. Check destruction timing → delete job may run while worker is active.
4. Force scheduling → adding sleeps changes behavior → confirms timing race.
5.TSAN (optional) → reports race between delete and run

Root Cause

Lifetime is not synchronized.
job is destroyed while worker still uses it.

Fix

Use shared_ptr or explicit synchronization:

std::shared_ptr<Job> job = std::make_shared<Job>();

void worker() {
    auto j = job;   // safe promotion
    if (j) j->run();
}

void cleanup() {
    job.reset();    // safe destruction
}

8. Example 3 — Using TSAN to Diagnose a Race Condition

This example shows a real race condition that does not crash reliably, but TSAN catches immediately.
It demonstrates how to use the tool and how to interpret its output.

Code

#include <thread>
#include <iostream>

int counter = 0;

void worker() {
    for (int i = 0; i < 100000; ++i) {
        counter++;   // unsynchronized write
    }
}

int main() {
    std::thread t1(worker);
    std::thread t2(worker);

    t1.join();
    t2.join();

    std::cout << "counter = " << counter << "\n";
}

This program usually prints something close to 200000, but:

sometimes prints a smaller number
sometimes prints a corrupted value
sometimes crashes
sometimes works perfectly

These are classic S5 behavior.

Symptom

Nondeterministic output
Crash appears only under load
Crash disappears when adding logging
Crash location moves
Debugger hides the bug

This is the signature of a race‑condition crash.

Diagnostic Path

1. Reproduce under stress

Running the program in a loop:

for i in {1..1000}; do ./a.out; done

It produces inconsistent results.

2. Try TSAN

Compile with Thread Sanitizer:

clang++ -fsanitize=thread -g -O1 main.cpp -o tsan_test

Run it:

./tsan_test

3. TSAN immediately reports the race

TSAN output (simplified):

WARNING: ThreadSanitizer: data race
  Write of size 4 at counter by thread T1
    #0 worker main.cpp:7
  Previous write of size 4 at counter by thread T2
    #0 worker main.cpp:7

  Location is global 'counter' at main.cpp:3

TSAN tells us:

what is racing (counter)
where the race happens (line 7)
which threads are involved (T1 and T2)
what type of access (write/write)

This is the fastest way to diagnose S5.

Root Cause

counter is shared mutable state accessed by multiple threads without synchronization.

Even though int is small, incrementing it is not atomic:

load → add → store

Two threads interleave these steps unpredictably.

Fix

Use a mutex:

std::mutex m;
int counter = 0;

void worker() {
    for (int i = 0; i < 100000; ++i) {
        std::lock_guard<std::mutex> lock(m);
        counter++;
    }
}

Or use an atomic:

std::atomic<int> counter{0};

void worker() {
    for (int i = 0; i < 100000; ++i) {
        counter.fetch_add(1, std::memory_order_relaxed);
    }
}

After fixing, TSAN reports no races, and the program becomes deterministic.

9. When It’s Not This Pattern

S5 is not the correct pattern when:

Crash is deterministic → S1 or S4
Backtrace is corrupted → S3
Crash location is stable → S1
Crash disappears only under sanitizers → S2
Crash happens on wrong thread → S4

S5 is specifically about timing‑dependent failures.

10. Summary

Race‑condition crashes happen when multiple threads access shared state without proper synchronization.
The failure depends on timing, not code correctness.
The crash location moves, disappears under debugging, and reappears under load.

The signature is consistent:

nondeterministic
timing‑dependent
moving crash location
sometimes clean, sometimes corrupted backtrace
disappears under debugging

The only thing wrong is the interleaving.

11. Takeaways

S5 is timing‑dependent — the crash depends on interleavings.
Crash location moves — the crashing line is rarely the bug.
Debugging changes timing — hiding the failure.
TSAN is our best friend — use it early.
Locks or message‑passing fix most races.
Lifetime must be synchronized — destruction races are common.

DEV Community

C++ Crash Pattern S5 — Race‑Condition Crashes: How to Diagnose and Fix Them

1. What Is a Race‑Condition Crash?

2. What Race‑Condition Crashes Look Like

Why S5 Nondeterminism Is Different from S2/S3

3. Likely Patterns — Root Causes

4. Diagnostic Techniques

5. Remediation Steps

6. Example 1 — Unsynchronized Access to Shared State

Code

Symptom

Diagnostic Path

Root Cause

Fix

7. Example 2 — Lifetime Race (Use‑After‑Free)

Code

Symptom

Diagnostic Path

Root Cause

Fix

8. Example 3 — Using TSAN to Diagnose a Race Condition

Code

Symptom

Diagnostic Path

1. Reproduce under stress

2. Try TSAN

3. TSAN immediately reports the race

Root Cause

Fix

9. When It’s Not This Pattern

10. Summary

11. Takeaways

Top comments (0)