Why race‑condition crashes only appear under specific timing, vanish under debugging, and produce different stack traces on every run.
Race‑condition crashes are the first pattern in the series where the failure is not deterministic.
The code is correct, the memory is valid, the backtrace may look reasonable, and the crash may even disappear under debugging — but the program still fails unpredictably.
S5 crashes are timing‑dependent failures caused by two or more threads accessing shared state without proper synchronization.
Unlike S4 (wrong‑thread crashes), S5 failures depend on when things happen, not where they happen.
This article explains how to recognize race‑condition crashes, diagnose them efficiently, and fix the underlying concurrency defects.
1. What Is a Race‑Condition Crash?
A race‑condition crash occurs when:
- multiple threads access shared state
- at least one access is a write
- the accesses are not synchronized
- the outcome depends on timing
The crash is not tied to a specific line of code.
It is tied to interleavings — the order in which threads execute.
S5 crashes are timing failures: the code is correct in isolation, but incorrect when interleavings change.
2. What Race‑Condition Crashes Look Like
Race‑condition crashes have a distinctive signature:
1. Crash location moves
The crash may appear in different functions on different runs.
2. Crash disappears under debugging
Breakpoints, logging, or sanitizers change timing and hide the bug.
3. Crash frequency depends on load
More threads → more failures.
Single‑threaded mode → no failures.
4. Backtrace may look valid or corrupted
Sometimes clean, sometimes garbage — depends on the interleaving.
5. Reproduction is difficult
We may need stress tests, loops, or special timing to trigger it.
6. The crashing line is rarely the root cause
The defect is almost always upstream, in a missing lock or incorrect ownership rule.
Why S5 Nondeterminism Is Different from S2/S3
Race‑condition crashes are nondeterministic because the failure depends on timing, not on corrupted memory.
This is different from S2 and S3: heap and stack corruption also appear nondeterministic, but for a different reason — the program state is already broken, so the crash location moves as corrupted data flows through the system.
In S5, the program is correct in isolation, but incorrect when two threads interleave in the wrong order -- timing matters.
The nondeterminism comes from the scheduler, not from memory corruption.
3. Likely Patterns — Root Causes
Race‑condition crashes come from a small set of mechanisms:
1. Unsynchronized read/write access
One thread writes while another reads.
2. Double‑delete or premature delete
One thread destroys an object while another still uses it.
3. Incorrect use of atomics
Atomics fix visibility, not invariants.
We can still race on multi‑field state.
4. Missing or incorrect locking
Lock not taken, taken too late, or taken in the wrong order.
5. Data structures not designed for concurrency
Vectors, maps, lists, and custom objects are not thread‑safe by default.
6. Races inside callbacks
Callbacks fire on different threads and access shared state.
7. Races in lifetime management
Weak pointer promoted too late, shared pointer destroyed too early.
4. Diagnostic Techniques
Debugging S5 means debugging timing, not memory.
1. Reproduce under stress
Increase thread count, reduce delays, run loops, or use stress harnesses.
2. Use TSAN (Thread Sanitizer)
TSAN is the single most effective tool for detecting data races.
3. Add temporary logging
But be aware: logging changes timing and may hide the bug.
4. Look for shared state
Any object accessed by multiple threads is suspicious.
5. Check lifetime boundaries
- Who owns the object?
- Who destroys it?
- Is destruction synchronized?
6. Examine invariants
Multi‑field invariants require locks, not atomics.
7. Reproduce with forced scheduling
Pin threads, add artificial delays, or use deterministic schedulers.
5. Remediation Steps
Fixing S5 means strengthening synchronization and ownership rules.
1. Add locks around shared state
Mutexes, shared_mutex, or custom guards.
2. Use message‑passing instead of shared state
Push work to the owning thread.
3. Strengthen lifetime management
Use shared_ptr/weak_ptr carefully.
Destroy objects only when no thread can access them.
4. Avoid “lock‑free” unless we truly need it
Lock‑free code is extremely hard to get right.
5. Use atomics only for simple state
Atomics do not protect invariants across multiple fields.
6. Make thread ownership explicit
Document which thread owns which object.
6. Example 1 — Unsynchronized Access to Shared State
A classic race: two threads modify a shared vector.
Code
std::vector<int> data;
void writer() {
for (int i = 0; i < 1000; ++i) {
data.push_back(i);
}
}
void reader() {
for (int i = 0; i < 1000; ++i) {
int x = data[i]; // sometimes valid, sometimes crash
}
}
Symptom
- Sometimes works
- Sometimes crashes
- Sometimes SIGSEGV
- Sometimes out‑of‑range
- Sometimes corrupted data
Diagnostic Path
1. Reproduce under stress → inconsistent results → timing issue.
2. Identify shared state → data accessed by two threads.
3. Check for synchronization → none; vector is not thread‑safe.
4. Confirm with TSAN (optional) → write/read race on data.
This leads directly to the root cause: unsynchronized access to a non‑thread‑safe container.
Root Cause
std::vector is not thread‑safe.
Concurrent push_back and read cause reallocation and invalidation.
Fix
std::mutex m;
void writer() {
std::lock_guard<std::mutex> lock(m);
data.push_back(...);
}
void reader() {
std::lock_guard<std::mutex> lock(m);
int x = data[i];
}
7. Example 2 — Lifetime Race (Use‑After‑Free)
A worker thread uses an object after another thread destroys it.
Code
struct Job {
void run() { /* ... */ }
};
Job* job = new Job();
void worker() {
job->run(); // sometimes valid, sometimes UAF
}
void cleanup() {
delete job; // races with worker
}
Symptom
- Crash location moves
- Sometimes SIGSEGV
- Sometimes SIGABRT
- Sometimes no crash
Diagnostic Path
1. Observe nondeterminism → suggests lifetime race.
2. Check ownership → job shared by worker + cleanup.
3. Check destruction timing → delete job may run while worker is active.
4. Force scheduling → adding sleeps changes behavior → confirms timing race.
5.TSAN (optional) → reports race between delete and run
Root Cause
Lifetime is not synchronized.
job is destroyed while worker still uses it.
Fix
Use shared_ptr or explicit synchronization:
std::shared_ptr<Job> job = std::make_shared<Job>();
void worker() {
auto j = job; // safe promotion
if (j) j->run();
}
void cleanup() {
job.reset(); // safe destruction
}
8. Example 3 — Using TSAN to Diagnose a Race Condition
This example shows a real race condition that does not crash reliably, but TSAN catches immediately.
It demonstrates how to use the tool and how to interpret its output.
Code
#include <thread>
#include <iostream>
int counter = 0;
void worker() {
for (int i = 0; i < 100000; ++i) {
counter++; // unsynchronized write
}
}
int main() {
std::thread t1(worker);
std::thread t2(worker);
t1.join();
t2.join();
std::cout << "counter = " << counter << "\n";
}
This program usually prints something close to 200000, but:
- sometimes prints a smaller number
- sometimes prints a corrupted value
- sometimes crashes
- sometimes works perfectly
These are classic S5 behavior.
Symptom
- Nondeterministic output
- Crash appears only under load
- Crash disappears when adding logging
- Crash location moves
- Debugger hides the bug
This is the signature of a race‑condition crash.
Diagnostic Path
1. Reproduce under stress
Running the program in a loop:
for i in {1..1000}; do ./a.out; done
It produces inconsistent results.
2. Try TSAN
Compile with Thread Sanitizer:
clang++ -fsanitize=thread -g -O1 main.cpp -o tsan_test
Run it:
./tsan_test
3. TSAN immediately reports the race
TSAN output (simplified):
WARNING: ThreadSanitizer: data race
Write of size 4 at counter by thread T1
#0 worker main.cpp:7
Previous write of size 4 at counter by thread T2
#0 worker main.cpp:7
Location is global 'counter' at main.cpp:3
TSAN tells us:
- what is racing (counter)
- where the race happens (line 7)
- which threads are involved (T1 and T2)
- what type of access (write/write)
This is the fastest way to diagnose S5.
Root Cause
counter is shared mutable state accessed by multiple threads without synchronization.
Even though int is small, incrementing it is not atomic:
load → add → store
Two threads interleave these steps unpredictably.
Fix
Use a mutex:
std::mutex m;
int counter = 0;
void worker() {
for (int i = 0; i < 100000; ++i) {
std::lock_guard<std::mutex> lock(m);
counter++;
}
}
Or use an atomic:
std::atomic<int> counter{0};
void worker() {
for (int i = 0; i < 100000; ++i) {
counter.fetch_add(1, std::memory_order_relaxed);
}
}
After fixing, TSAN reports no races, and the program becomes deterministic.
9. When It’s Not This Pattern
S5 is not the correct pattern when:
- Crash is deterministic → S1 or S4
- Backtrace is corrupted → S3
- Crash location is stable → S1
- Crash disappears only under sanitizers → S2
- Crash happens on wrong thread → S4
S5 is specifically about timing‑dependent failures.
10. Summary
- Race‑condition crashes happen when multiple threads access shared state without proper synchronization.
- The failure depends on timing, not code correctness.
- The crash location moves, disappears under debugging, and reappears under load.
The signature is consistent:
- nondeterministic
- timing‑dependent
- moving crash location
- sometimes clean, sometimes corrupted backtrace
- disappears under debugging
The only thing wrong is the interleaving.
11. Takeaways
- S5 is timing‑dependent — the crash depends on interleavings.
- Crash location moves — the crashing line is rarely the bug.
- Debugging changes timing — hiding the failure.
- TSAN is our best friend — use it early.
- Locks or message‑passing fix most races.
- Lifetime must be synchronized — destruction races are common.
Top comments (0)