https://dl.acm.org/doi/10.1145/2815400.2815409
Coz: finding code that counts with causal profiling by Charlie Curtsinger and Emery D. Berger
This paper presents a novel way to profile multi-threaded programs. Conventional profilers measure how much time a program spends in its different parts. This is insufficient for multi-threaded programs, as optimizing these parts might not speed up overall performance at all. In some cases it might even slow it down, due to increased resource contention. To overcome this limitation, the authors introduce "virtual speedups": by slowing down everything else but the optimization target, one can estimate how much relative overall effect speeding up the target would cause.
They implemented a profiler based on this, available on Github: https://github.com/plasma-umass/coz It makes use of Linux's perf system. Instead of instrumenting the code, it collects samples and injects slowdowns every time the optimization target's address was found in the samples. Each thread processes their own samples, and the slowdowns are handled by a global and thread-local slowdown counts. Each time a thread find a sample containing the target, it increases both the global and its own thread-local count. If the local count is smaller than the global, it increases it and sleeps.
Their evaluation is promising, it has low overhead and the performance issues they found using it are interesting.