When the stack frames become computationally expensive

Question

I've been experimenting with different data structures and algorithms in Python, Java and C to see in what circumstances function/method inlining could bring meaningful gains in terms of the execution time. So far I don't think I was able to come up with a good example, meaning stack frame creation seems irrelevant in terms of CPU use / code execution speed, given all other things going on. Even in Python.

Would you be able to point out any obvious candidates (data structures, algorithms) where function/method inlining could have a significant impact on the code execution time, in a context of the entire algorithm?

Please note I'm not referring here to memory utilisation or stack overflows.

Most obviously when the inlined function does nothing at all and is altogether removed, or just returns a constant. And there is often low level code where an inlined function calls an inlined function which calls an inlined function and 10 more levels - and everything turns into some rather trivial code. — gnasher729, CommentedJul 6, 2024 at 15:05
The typical example is comparing std::sort (which takes the sort predicate as a function object, gets specialized on its type, and thus gets it inlined) compared to qsort (which gets a pointer and thus dynamic dispatch without inlining). — Sebastian Redl, CommentedJul 6, 2024 at 15:44
Inlining functions is not about getting rid of call per se. That's hardly worth it. The real benefit is that it opens doors for optimizations. Because once you inline a function it may turn out that it is a dead code for example. Of course that won't be noticable in Python, which is not capable of such optimizations, due to its (too dynamic) nature. — freakish, CommentedJul 6, 2024 at 19:45
function/method inlining does not typically change the O(). The best first step is to insure the algorithm used is optimal. The liner improvements of inlining are a secondary concern. — chux, CommentedJul 6, 2024 at 20:52
@gnasher729 I don't know much about POWER, but honestly this sounds very weird. Can you point me to some reference? I couldn't find any. — freakish, CommentedJul 8, 2024 at 16:04

freakish · Accepted Answer · 2024-07-09 13:28:49Z

There are two questions here:

What is the cost of function call?
Is inlining a function worth it?

So first of all a function call costs. Saving registers, setting up frame, two jumps, loading registers, yada yada. Not much, but also not zero. Depending on platform it can be higher or lower, but on those platforms that I have experience with (x64 + aarch64 and their 32-bit counterparts) the cost is very very small. In fact negligible. In fact even virtual calls are often negligible. So why do people bother with devirtualization you may ask?

The true power of inlining comes from the fact that after inlining a compiler is capable of aplying more optimizations. So for example say we have this code (I'll be using Rust here):

fn cond(x: i32) -> bool; fn calc(limit: i32) -> i32 { let sum = 0; for i in 0..limit { if cond(i) { sum += 1; } } return sum; }

(side note: I know that in Rust one may omit return keyword. I'm using it deliberately to make it more readable for those who do not know Rust)

Without inlining this code is at least O(limit), regardless of what cond does (this function can only increase complexity, but it cannot decrease it). But, if say

fn cond(i: i32) -> bool { return i < 0; }

is inlined, then clever compiler will notice that this condition can never be satisfied in our for loop and so it is a dead code. Therefore entire loop can be eliminated (since it doesn't have any observable effect), and so a smart compiler will optimize our calc to

fn calc(limit: i32) -> i32 { return 0; }

which is O(1). Without inlining such optimization wouldn't be possible.

That being said, to be honest, in reality you will rarely see this kind of drastic performance improvement. Because people rarely write such code as above.

For a concrete example I've created the following code in Rust (with rand = "0.8.5" lib):

use std::{ops::Range, time::Instant}; use rand::{self, Rng}; #[inline(always)] fn key(x: &i32) -> i32 { return -x; } fn main() { // Build random array... const AVG_ARR_RANGE: usize = 10000000; const ARR_RANGE: Range<usize> = (AVG_ARR_RANGE - 1000)..(AVG_ARR_RANGE+1000); let mut rng = rand::thread_rng(); let arr_len = rng.gen_range(ARR_RANGE); let mut arr = Vec::with_capacity(arr_len); for _ in 0..arr_len { let value: i32 = rng.gen_range(0..(AVG_ARR_RANGE as i32)); arr.push(value); } // ...and measure the time it takes to sort it. let now = Instant::now(); { arr.sort_by_key(key); } let elapsed = now.elapsed(); println!("Elapsed: {:.2?}", elapsed); }

which is just sorting a random array by key function. And then I run it on my M2 mac in release mode, with two variants: #[inline(always)] and #[inline(never) over key function. The result is that #[inline(always)] improves performance by ~5%. Not bad, indeed.

So the obvious candidate for such optimization is when an algorithm accepts a callback, and that callback is going to be called in a tight loop. Then whether this callback is inlinable or not may matter.

That being said there are certain cases where inline not only won't bring performance benefits, but in fact will degrade performance. Why? Because life is hard and CPUs are complicated. Consider this simple code:

fn calc(cond: bool) -> i32 { if cond { return second_function(); } else { return 0; } }

Now if you don't inline second_function and cond is rarely true then it will go straight to return 0;. What happens if we inline the nested call? It sounds like nothing really, but that's not true. The body of calc becomes bigger. And if by chance it exceeds the size of cpu cache, then now suddenly calling this function may result in additional memory load (of instructions) and performance degrades.

And in fact this performance degradation, purely due to inlining, has been observed multiple times, read this: https://stackoverflow.com/questions/24452525/is-there-an-actual-example-where-inline-is-detrimental-to-the-performance-of-a-c

The moral of this story is: always measure.

Even in Python.

Forget about inlining in the context of Python. That technique is irrelevant here. Python is not optimizable the same way other statically typed languages are. It is just too dynamic. Plus Python's bad performance is vastly dominated by things like Global Interpreter Lock and the fact that most objects are heap allocated. CPU cache also doesn't do much in Python (to the point of being irrelevant), since every read in Python updates atomic, shared ref counters. And many many other performance hits, which accumulate into performance being worse by hundreds of times compared to say C. With such huge wave of bottlenecks those small function calls are invisible.

Of course the general "always measure" advice still applies to Python. In case you need to squeeze performance out of cpu you can always use some native libraries like numpy or something. But if I were to optimize pure Python, then I would focus on algorithms, and I wouldn't bother with micro optimizations like inlining.

"even in python" I think "always measure" still applies, and the question is just "when is it worth using numba or writing some c for this loop where the code spends all its time" — Kaia, CommentedJul 8, 2024 at 18:33
@Kaia yes, "always measure" applies to Python as well. What I meant is that (manual) inlining is irrelevant in Python, it won't affect performance in a noticable way. I've updated the answer. — freakish, CommentedJul 9, 2024 at 10:21
"Even in Python" refers to what Guido van Rossum mentioned back in 2012, "Be suspicious of function/method calls; creating a stack frame is expensive": web.archive.org/web/20180919031245/https://plus.google.com/…. Either things have changed since then, or Guido was concerned about even these small costs. — ipastusi, CommentedJul 17, 2024 at 15:19
@ipastusi Python does more work when setting up stack frame. It does some additional processing, e.g. it passes traceback around (file name, line, etc). So yes, this is more expensive than low level function call. That being said I think Guido is simply wrong: that cost is irrelevant considering how much other things in Python cost. And your experiments confirm that. — freakish, CommentedJul 17, 2024 at 16:26

J_H · Accepted Answer · 2024-07-08 03:28:56Z

function call overhead

When you speak of stack frames that are "computationally expensive" you're really talking about the expense incurred to save state, transmit arguments, receive result, and restore state for each function call. Collectively these are "overhead". Notice that, even without a stack overflow, aggressive recursion can blow out the L1 cache, so even simple push / pop operations take longer than you might normally expect.

candidates where function inlining could have a significant impact on the code execution time

"Significant" is talking about whether overhead time is an interesting fraction of total compute time. So you're asking for candidates that do very little compute on each call. One such example that springs to mind is the Fibonacci problem. At each step we do a tiny amount of very simple computation, so function call overhead turns out to be quite significant. Here are some python timings.

from functools import cache from linetimer import CodeTimer def iterative_fibo(n: int) -> int: a, b = 0, 1 for _ in range(n): a, b = b, a + b return a @cache def recursive_fibo(n: int) -> int: if n < 2: return n return recursive_fibo(n - 1) + recursive_fibo(n - 2) def binet_fibo(n: int) -> int: phi = (1 + 5**0.5) / 2 return int((phi**n - (-phi) ** -n) / 5**0.5) if __name__ == "__main__": answer = 280571172992510140037611932413038677189525 with CodeTimer("iterative"): assert answer == iterative_fibo(200) with CodeTimer("recursive"): assert answer == recursive_fibo(200) # Alas, FP is no match for BigInt. for n in range(72): assert binet_fibo(n) == iterative_fibo(n) for n in range(86): assert binet_fibo(n) - iterative_fibo(n) < 1000 n = 72 assert binet_fibo(n) == iterative_fibo(n) + 1 n -= 1 with CodeTimer("analytic "): assert binet_fibo(n) == iterative_fibo(n)

output:

Code block 'iterative' took: 0.01313 ms Code block 'recursive' took: 0.28720 ms Code block 'analytic ' took: 0.00435 ms

So we see a bit more than 20x speedup by avoiding function calls.

The recursive version actually performs more function calls than might be apparent at first blush, since we have to call a function wrapper which probes a dict of cached results. We can cut such overhead in half by simply rephrasing as:
return recursive_fibo(n - 2) + recursive_fibo(n - 1)

And that last elapsed time isn't quite fair, as the FP calculation does something quite different, and less powerful, than the BigInt calculations.

In general, a function that spends just a brief moment performing very little computation before next function call will be one which highlights the calling overhead. If you make a great many calls to such a function, the fraction of time spent on overhead will be significant.

The power of that @cache decorator line is easily overlooked. Here are some example timings from when it is commented out so recursive_fibo is doing it "the hard way", frequently repeating work for overall O(n^2) quadratic cost. In this pandas dataframe, notice how the power-of-ten exponent for elapsed milliseconds explodes by a hundred for each increment of ten.

 n msec 0 5.200000e-04 10 1.161000e-02 20 1.745130e+00 30 2.009133e+02 40 2.403903e+04 50 2.573085e+06

You didn’t get a speed up by avoiding function calls, but by replacing a stupid algorithm (Fibonacci with naive recursion) with a more clever one. — gnasher729, CommentedJul 7, 2024 at 7:01
@gnasher729, it’s not naive recursion — f(200) takes too long for that. It’s an approach that relies on the mathematical recurrence, while exploiting a cache of values we’ve already computed. The complexity is similar to iterative. I was looking for examples that compute “same thing” two ways. Fibo is not the only one, but it is perhaps the simplest. docs.python.org/3/library/functools.html#functools.cache Quadratic uncached would be quite different. — J_H, CommentedJul 7, 2024 at 15:12

Stack Exchange Network

When the stack frames become computationally expensive

2 Answers 2

function call overhead

Hot Network Questions

When the stack frames become computationally expensive

2 Answers 2

function call overhead

Related

Hot Network Questions