There are two questions here:
- What is the cost of function call?
- Is inlining a function worth it?
So first of all a function call costs. Saving registers, setting up frame, two jumps, loading registers, yada yada. Not much, but also not zero. Depending on platform it can be higher or lower, but on those platforms that I have experience with (x64 + aarch64 and their 32-bit counterparts) the cost is very very small. In fact negligible. In fact even virtual calls are often negligible. So why do people bother with devirtualization you may ask?
The true power of inlining comes from the fact that after inlining a compiler is capable of aplying more optimizations. So for example say we have this code (I'll be using Rust here):
fn cond(x: i32) -> bool; fn calc(limit: i32) -> i32 { let sum = 0; for i in 0..limit { if cond(i) { sum += 1; } } return sum; }
(side note: I know that in Rust one may omit return
keyword. I'm using it deliberately to make it more readable for those who do not know Rust)
Without inlining this code is at least O(limit), regardless of what cond
does (this function can only increase complexity, but it cannot decrease it). But, if say
fn cond(i: i32) -> bool { return i < 0; }
is inlined, then clever compiler will notice that this condition can never be satisfied in our for
loop and so it is a dead code. Therefore entire loop can be eliminated (since it doesn't have any observable effect), and so a smart compiler will optimize our calc
to
fn calc(limit: i32) -> i32 { return 0; }
which is O(1). Without inlining such optimization wouldn't be possible.
That being said, to be honest, in reality you will rarely see this kind of drastic performance improvement. Because people rarely write such code as above.
For a concrete example I've created the following code in Rust (with rand = "0.8.5"
lib):
use std::{ops::Range, time::Instant}; use rand::{self, Rng}; #[inline(always)] fn key(x: &i32) -> i32 { return -x; } fn main() { // Build random array... const AVG_ARR_RANGE: usize = 10000000; const ARR_RANGE: Range<usize> = (AVG_ARR_RANGE - 1000)..(AVG_ARR_RANGE+1000); let mut rng = rand::thread_rng(); let arr_len = rng.gen_range(ARR_RANGE); let mut arr = Vec::with_capacity(arr_len); for _ in 0..arr_len { let value: i32 = rng.gen_range(0..(AVG_ARR_RANGE as i32)); arr.push(value); } // ...and measure the time it takes to sort it. let now = Instant::now(); { arr.sort_by_key(key); } let elapsed = now.elapsed(); println!("Elapsed: {:.2?}", elapsed); }
which is just sorting a random array by key function. And then I run it on my M2 mac in release mode, with two variants: #[inline(always)]
and #[inline(never)
over key
function. The result is that #[inline(always)]
improves performance by ~5%. Not bad, indeed.
So the obvious candidate for such optimization is when an algorithm accepts a callback, and that callback is going to be called in a tight loop. Then whether this callback is inlinable or not may matter.
That being said there are certain cases where inline not only won't bring performance benefits, but in fact will degrade performance. Why? Because life is hard and CPUs are complicated. Consider this simple code:
fn calc(cond: bool) -> i32 { if cond { return second_function(); } else { return 0; } }
Now if you don't inline second_function
and cond
is rarely true then it will go straight to return 0;
. What happens if we inline the nested call? It sounds like nothing really, but that's not true. The body of calc
becomes bigger. And if by chance it exceeds the size of cpu cache, then now suddenly calling this function may result in additional memory load (of instructions) and performance degrades.
And in fact this performance degradation, purely due to inlining, has been observed multiple times, read this: https://stackoverflow.com/questions/24452525/is-there-an-actual-example-where-inline-is-detrimental-to-the-performance-of-a-c
The moral of this story is: always measure.
Even in Python.
Forget about inlining in the context of Python. That technique is irrelevant here. Python is not optimizable the same way other statically typed languages are. It is just too dynamic. Plus Python's bad performance is vastly dominated by things like Global Interpreter Lock and the fact that most objects are heap allocated. CPU cache also doesn't do much in Python (to the point of being irrelevant), since every read in Python updates atomic, shared ref counters. And many many other performance hits, which accumulate into performance being worse by hundreds of times compared to say C. With such huge wave of bottlenecks those small function calls are invisible.
Of course the general "always measure" advice still applies to Python. In case you need to squeeze performance out of cpu you can always use some native libraries like numpy or something. But if I were to optimize pure Python, then I would focus on algorithms, and I wouldn't bother with micro optimizations like inlining.
std::sort
(which takes the sort predicate as a function object, gets specialized on its type, and thus gets it inlined) compared toqsort
(which gets a pointer and thus dynamic dispatch without inlining).