Prologue
Memory management is annoying, that’s why we have garbage collectors. But garbage collectors are not free, they consume CPU time and memory. If a program is using a garbage collector, it’s important to make sure the garbage collector is efficient.
The most important part of GC performance?
Many would argue that the GC performance is not THAT important, because most GC implementations would suggest they take only a small fraction of the total execution time, like 10% or even less. That’s true to some extent, but I would say GC stw time is not the most important part of GC performance. The core of GC is always allocation.
In most real-world applications, the allocation rate is really high. The allocation speed can make a huge difference in the performance of the program. Especially in some languages like C#, where objects are default allocated on the heap, the allocation speed is critical.
There’s commonly two ways to reduce the allocation cost:
- Reduce the number of allocations
- Reduce the cost of each allocation
Reducing the number of allocations
Reducing the number of allocations is always a good idea, especially if you use heap allocation as default. This can be done by escape analysis pass, which can determine if an object can be stack-allocated, if so the heap allocation will be changed to stack allocation.
Reducing the cost of each allocation
Reducing the cost of each allocation is also important. The most easy way to reduce the cost of each allocation is to keep the allocation hot path super simple. I mean, really just a few instructions, like bumping a pointer. You can have some more complex logic in the slow path, but the hot path – which shall be executed most of the time – should be as simple as possible.
Here’re some tips to make the allocation hot path simple:
- Avoid system calls, like
mmap
orsbrk
or locks. Note that many language’s time api also includes system calls, likeinstant::now
in Rust. - Avoid atomic operations. Many would think atomic operations are fast enough, but they are not for allocation. The atomic operations may involve cache line invalidation, which is very expensive in a hot path.
- Using thread-local storage carefully. Thread-local storage is essentiall for some high performance allocators, but it may not be a good idea to use it directly in the lib code. Thread locals has different performance characteristics depending on whether it’s in a dynamic library or an executable. Also, different languages have different thread local implementations, which may be much slower than the thread local in C++. Those slow implementations may have to do a system call to get the thread id, and check if the thread local is initialized, which is inacceptable in a hot path. So, if you want to use thread local, make sure it’s fast enough.
TO BE CONTINUED