The Rust Performance Trap I Hit While Sorting Small Network Datasets
I recently ran into one of those performance problems that looks completely obvious in hindsight, but was surprisingly hard to notice while I was inside the code.
I was working on a deep search and ranking part of a Rust project that processed network-related data.
The system had to evaluate many small groups of candidates repeatedly. You can think of these candidates as possible network paths, intermediate nodes, route-like records, latency candidates, or weighted decisions inside a search tree.
At each step, I had a small list of candidates.
Not thousands of items.
Usually something like 10, 20, maybe 40 or 50.
Each candidate needed to be scored and ordered before the search could continue.
The first version of the code was simple and readable:
let mut candidates = Vec::new();
for item in input {
candidates.push(item);
}
candidates.sort_by_key(|item| compute_score(item));
It worked.
It was clean.
But after profiling it with a flamegraph, I noticed something uncomfortable: the program was spending too much time around temporary allocations.
At first, that sounds strange. A small Vec does not look expensive.
But in a deep search algorithm, a "small allocation" repeated millions of times is no longer small.
That was the beginning of the rabbit hole.
The First Bottleneck: Heap Allocation
The first problem was heap allocation.
Every time the search visited a node, it created temporary vectors to collect, score, and sort candidates. Since the search tree was deep, this happened again and again.
One allocation does not matter much.
Thousands of allocations may still be fine.
But millions of tiny allocations inside a hot path can become a serious bottleneck.
The important thing is this:
Heap allocation is not just "getting some memory".
There is real machinery behind it.
When a program allocates memory on the heap, the allocator has to find a suitable block of memory, update internal metadata, sometimes handle alignment, sometimes split or reuse blocks, and later track that memory so it can be freed safely.
Even if the allocator is very optimized, this is still more expensive than using memory that is already available on the stack.
With stack memory, the cost is usually very small. The program mostly adjusts the stack pointer. It is simple, predictable, and extremely cache-friendly.
Heap memory is more flexible, but that flexibility has a price.
In my case, I did not need flexibility.
I already knew the candidate lists were small. I did not need an unbounded dynamic vector for every step of the search. I needed a tiny temporary buffer.
But I was still paying the cost of heap allocation over and over again.
Why Heap Allocation Can Be Expensive in a Hot Path
Heap allocation becomes especially problematic when it happens inside a tight loop or a deeply repeated algorithm.
There are a few reasons for that.
1. Allocator Overhead
A heap allocator has to manage memory dynamically.
That means it needs bookkeeping.
When I create a Vec, Rust itself is not doing something inefficient. Vec is a great data structure. But when it needs capacity, it asks the allocator for heap memory.
The allocator has to find a free region that fits the requested size. It may reuse an existing block, split a larger block, or request more memory from the system.
Even when this is fast, it is not free.
In normal application code, this cost may not matter. In a hot path, it can dominate the runtime.
2. Memory Locality
Stack memory is usually very close to the current execution context.
Small arrays on the stack tend to be friendly to the CPU cache. The CPU can load nearby memory efficiently, and the access pattern is usually predictable.
Heap memory can be more scattered.
If temporary vectors are allocated and freed repeatedly, the actual memory locations may not be as local as a compact stack buffer.
This matters because modern CPUs are extremely fast, but memory is relatively slow. A lot of real-world performance comes down to whether the CPU can keep useful data in cache.
If the CPU has to wait for memory, the algorithm may look good on paper but still run slower in practice.
3. Fragmentation and Metadata
Heap allocators also maintain metadata.
They need to know which blocks are free, which blocks are used, and how large those blocks are.
Over time, allocation and deallocation patterns can create fragmentation. For small short-lived allocations, modern allocators are usually smart, but they still have to manage the lifecycle of memory.
For a deep search algorithm, this was unnecessary noise.
The lifetime of the data was very short. Each temporary list only lived during one search step.
That is exactly the kind of data that usually fits better on the stack.
4. More Pressure on the CPU Cache
The real issue was not only the cost of allocation itself.
The bigger problem was the combination of allocation, sorting, scoring, and repeated traversal.
The algorithm was doing many tiny operations. Each operation looked harmless, but together they created pressure on the CPU cache.
Temporary heap allocations increased the amount of memory traffic. The CPU had to touch more memory, follow more pointers, and deal with less predictable access patterns.
When code runs once, this is not a big deal.
When code runs inside the hottest part of a search engine, it matters.
The First Optimization: Remove Temporary Heap Allocation
So my first idea was simple:
Stop creating temporary vectors in the hot path.
Instead of collecting candidates into a heap-allocated Vec, I moved toward a fixed-size temporary buffer.
Conceptually, something like this:
const MAX_CANDIDATES: usize = 64;
let mut buffer: [MaybeUninit<ScoredCandidate>; MAX_CANDIDATES] = unsafe {
MaybeUninit::uninit().assume_init()
};
The exact implementation depends on the project, of course. The important part is the idea:
- the number of candidates was bounded
- the temporary data had a very short lifetime
- the buffer could live on the stack
- no heap allocation was needed for every search node
After this change, I expected the program to become faster.
And partially, I was right.
The allocation cost disappeared from the flamegraph.
But then I hit the second trap.
The Surprising Part: Removing Allocations Made It Slower
After removing the temporary heap allocation, I used Rust's standard sorting method:
items.sort_unstable_by_key(|item| compute_score(item));
At first glance, this looked like the right choice.
It is in-place. It avoids extra allocation. It is part of the standard library. It uses a serious optimized sorting algorithm.
But the program became slower.
That was confusing.
I had removed heap allocation. I had improved memory usage. I was using a well-optimized standard method.
So why did performance drop?
The flamegraph gave me the answer.
The allocation cost was gone, but now the program was spending too much time recomputing scores.
The key detail is that sort_unstable_by_key does not cache the key for every item.
The closure can be called multiple times during sorting.
That is completely fine when the key is cheap, like reading an integer field:
items.sort_unstable_by_key(|item| item.priority);
But my key was not cheap.
The score calculation was based on multiple network-related signals, such as:
- estimated latency
- hop penalty
- node priority
- route freshness
- failure probability
- domain-specific weights
So every comparison could trigger real computation again.
And again.
And again.
I had removed heap allocation, but I had accidentally introduced repeated CPU work in the hottest part of the program.
The code looked clean, but it was doing more work than it appeared.
Why the Default Sort Was Not the Best Tool Here
Rust's sort_unstable_by_key is not bad.
Actually, it is very good.
The problem was not Rust.
The problem was that I used a general-purpose sorting method for a very specific workload.
Modern sorting algorithms are excellent for large arrays and general cases. Rust's unstable sort is designed to perform well across many real-world patterns.
But my workload had two special properties:
- The arrays were very small.
- The key calculation was relatively expensive.
For large arrays, an O(N log N) sort is usually the obvious choice.
But for tiny arrays, Big-O can be misleading.
When N is 10, 20, or 40, constant factors matter a lot.
At that size, the overhead of the sorting algorithm, the number of comparisons, branch behavior, cache locality, and repeated score calculation can matter more than the theoretical complexity.
In my case, the theoretical advantage of the sorting algorithm was less important than the practical cost of using it in a tiny hot loop.
The Fix: Cache the Scores and Sort a Small Stack Buffer
The final solution was simple:
- Compute the score for each candidate exactly once.
- Store the candidate and its score in a small stack buffer.
- Sort the scored candidates using insertion sort.
Conceptually:
struct ScoredCandidate<T> {
score: i32,
item: T,
}
Instead of asking the sorting algorithm to repeatedly call compute_score, I made the score part of the temporary data.
The scoring step became linear:
for candidate in candidates {
let score = compute_score(candidate);
scored.push(ScoredCandidate {
score,
item: candidate,
});
}
After that, sorting only compared already-computed integers.
That completely changed the cost profile.
The expensive part happened once per candidate, not many times per comparison.
Why Insertion Sort Worked Better
For the sorting step, I used insertion sort.
Yes, insertion sort.
The algorithm that is usually introduced early in computer science classes and then quickly ignored because it is O(N²).
But for small arrays, insertion sort can be extremely fast.
Why?
Because it is simple.
It has very little overhead. It works well with contiguous memory. It does not need extra allocation. It has predictable behavior. And when the array is small, the quadratic complexity does not have enough room to become a problem.
For example, sorting 20 or 30 items with insertion sort is not scary.
Especially when the data is already in a small stack buffer and comparisons are cheap integer comparisons.
In this case, insertion sort matched the workload better than a more advanced general-purpose sorting algorithm.
This is one of those cases where the "worse" algorithm on paper was better for the real machine.
The Result
After the change, the flamegraph looked completely different.
The repeated scoring calls disappeared from the hot path.
The allocation-related noise was gone.
The middle layers of the search became much faster.
The biggest improvement did not come from one magical trick. It came from aligning the implementation with the real shape of the workload:
- small candidate lists
- deep repeated search
- short-lived temporary data
- expensive scoring
- no need for heap flexibility
- stack-friendly memory layout
- simple sorting over cached scores
The final version was not more complicated in spirit.
It was actually more honest about what the program was doing.
The program did not need a dynamic vector and a general-purpose sort at every search node.
It needed a tiny local ranking buffer.
The Lesson
The main lesson for me was this:
The fastest algorithm in theory is not always the fastest algorithm on real hardware.
Big-O matters, but it is not the whole story.
In performance-sensitive code, especially backend, networking, infrastructure, or systems code, the real questions are often more practical:
- How large is the data really?
- How often does this code run?
- Is this allocation inside a hot path?
- Is the temporary memory short-lived?
- Is the key cheap or expensive?
- Are we recomputing something that could be cached?
- Is the memory layout friendly to the CPU cache?
- Does a general-purpose standard method match this exact workload?
Standard library methods are great defaults. Most of the time, they are exactly what we should use.
But "default" does not mean "always optimal".
In my case, the better solution was:
- avoid repeated heap allocations
- compute scores once
- store small temporary data on the stack
- use insertion sort for tiny arrays
That was enough to beat the more advanced-looking approach.
Final Thought
This experience reminded me that performance work is not only about knowing algorithms.
It is about understanding the shape of the data, the lifetime of memory, and how the CPU actually executes the code.
Sometimes the bottleneck is not where you expect it to be.
Sometimes removing an allocation exposes another hidden cost.
And sometimes the "bad" O(N²) algorithm is exactly what your CPU wanted.
Top comments (0)