Using output arguments in C++ to avoid dynamic allocations

Question

I have a function that repeatedly encodes Foos to string. I'm currently deciding between two ways to implement this:

Return by value:

std::string encode(const Foo& foo); void important_function() { while (1) { Foo foo = get_foo(); std::string encoded = encode(foo); save_to_file(encoded); } }

Use output argument:

void encode(const Foo& foo, std::string& encoded); void important_function() { std::string encoded; while (1) { Foo foo = get_foo(); encode(foo, encoded); save_to_file(encoded); } }

Advantages of return by value:

Cleaner looking.
Don't need to clean the string before reusing it.

Advantages of output argument:

Doesn't create a new std::string every iteration (and thus keeps the allocated buffer).

I'm currently thinking about this only from design point of view. I believe that performance won't be an issue.
Am I missing something here?
Is there some other way to get the clean looking code with no extra allocation cost?

Any changes in the code are possible, including changing types etc..

std::string is often implemented as a smart reference anyway. This will boil down to how does one pass out char*. Answer: it makes no difference. — Kain0_0, CommentedNov 18, 2020 at 5:21

JayZ · Accepted Answer · 2020-11-04 09:05:11Z

Is it important?

It is good to have those details in mind but is it really important right now in your development to know if a string will be allocated or not and if it will be a bottleneck for your application?

If yes, try both and measure. Chance are the difference is either minimal (compare to other algorithm issues) or null. If there really is a diffecrence you'll know wich solution to adopt.

If no, go with what's clearer (IMO the first one), and when you'll stumble upon performance issues, then you can profile your code to see where real bottlenecks are.

One should not needlessly write inefficient code. Clarity is generally sufficient to trump small efficiencies, but those should only be tolerated where doing so actually saves effort. One should care whether the current code is liable to be a bottleneck. If that is sufficiently likely, one has to measure and potentially tinker. If that is sufficiently unlikely, one gets to expend the saved effort where it matters. Be that finding and fixing a bottleneck, adding a feature, the next great thing, or the big party. — Deduplicator, CommentedNov 3, 2020 at 19:49

Caleth · Accepted Answer · 2020-11-03 10:19:43Z

Am I missing something here?

The as-if rule means that an implementation may treat both cases the same.

In the cases where you don't need to enlarge encoded in the second case, the allocator can easily re-use the same bytes. Otherwise both cases have to allocate a larger block.

I believe that performance won't be an issue Is there some other way to get the clean looking code with no extra allocation cost?

If performance won't be an issue, don't worry yourself over short-lived allocations.

It may, but is only likely to if it inlines the call. And the as-if-rule alone won't do it, there are special rules for allocation involved. Will it? That would need to be tested, after measuring that it really matters. As you re-iterated, efficiency is not an issue anyway. — Deduplicator, CommentedNov 3, 2020 at 11:07
btw I've tried this in the godbolt compiler explorer and I can't really tell which one is better :) godbolt.org/z/dc8f4d — cube, CommentedNov 3, 2020 at 15:07
@cube I suggest at least encoding something breaking the SBO threshold. Below that, re-using could not possibly help. — Deduplicator, CommentedNov 3, 2020 at 23:41

Jerry Coffin · Accepted Answer · 2020-11-18 05:09:15Z

Unless you're using a really old compiler, or working really hard at turning off all possible optimization, returning the value will normally be at least as efficient, and sometimes (often?) more efficient.

C++ has allowed what are called Return Value Optimization (RVO) and Named Return Value Optimization (NRVO) since it was first standardized in 1998 (and quite a while before, though what was or wasn't allowed was a bit more nebulous before the standard).

RVO/NRVO say that if you have a copy constructor with observable side effects, those side effects may not be observable in the case of returning a value like this. That may not seem like much, but the intent (and actual result) is that when you return a value that requires copy construction during the return, that copy construction will almost always be optimized away. Instead, the compiler basically creates the returned value that the caller will see, and passes a reference to that object to the function as a hidden parameter, and the function just constructs and (if necessary) manipulates that object via the reference.

So, let's put a concrete example to the test by compiling two bits of code and looking at the code they produce:

#include <string> std::string encode(int i) { return std::string(i, ' '); } void encode(int i, std::string &s) { s = std::string(i, ' '); }

The first produces this code:

encode[abi:cxx11](int): # @encode[abi:cxx11](int) push rbx mov rbx, rdi movsxd rsi, esi lea rax, [rdi + 16] mov qword ptr [rdi], rax mov edx, 32 call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct(unsigned long, char) mov rax, rbx pop rbx ret

This was compiled with Clang, but gcc produces nearly identical code. MSVC produces slightly different code, but the three have one major characteristic in common: returning the string doesn't involve copying with any of them.

Here's the code from the second version (this time compiled with gcc, but again, Clang is nearly identical, and MSVC fairly similar as well):

encode(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&): # @encode(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) push r15 push r14 push rbx sub rsp, 32 mov rbx, rsi movsxd rsi, edi lea r15, [rsp + 16] mov qword ptr [rsp], r15 mov r14, rsp mov rdi, r14 mov edx, 32 call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct(unsigned long, char) mov rsi, qword ptr [rsp] cmp rsi, r15 je .LBB1_1 lea rdx, [rbx + 16] mov rdi, qword ptr [rbx] mov rcx, qword ptr [rbx + 16] xor eax, eax cmp rdi, rdx cmovne rax, rdi mov qword ptr [rbx], rsi movups xmm0, xmmword ptr [rsp + 8] movups xmmword ptr [rbx + 8], xmm0 test rax, rax je .LBB1_10 mov qword ptr [rsp], rax mov qword ptr [rsp + 16], rcx jmp .LBB1_11 .LBB1_1: cmp r14, rbx je .LBB1_2 mov rdx, qword ptr [rsp + 8] test rdx, rdx je .LBB1_7 mov rdi, qword ptr [rbx] cmp rdx, 1 jne .LBB1_6 mov al, byte ptr [rsi] mov byte ptr [rdi], al jmp .LBB1_7 .LBB1_10: mov qword ptr [rsp], r15 mov rax, r15 jmp .LBB1_11 .LBB1_6: call memcpy .LBB1_7: mov rax, qword ptr [rsp + 8] mov qword ptr [rbx + 8], rax mov rcx, qword ptr [rbx] mov byte ptr [rcx + rax], 0 mov rax, qword ptr [rsp] .LBB1_11: mov qword ptr [rsp + 8], 0 mov byte ptr [rax], 0 mov rdi, qword ptr [rsp] cmp rdi, r15 je .LBB1_13 call operator delete(void*) .LBB1_13: add rsp, 32 pop rbx pop r14 pop r15 ret .LBB1_2: mov rax, rsi jmp .LBB1_11

This doesn't do any copying either, but as you can see, it is just a tad longer and more complex...

Here's a link to the code on Godbolt in case you want to play with different compilers, optimization flags, etc.: https://godbolt.org/z/vGc6Wx

score 1 · Accepted Answer · 2020-11-04 09:44:22Z

If your strings do vary wildly in size and often exceed the SBO size (typically around 16 bytes: sizeof(std::string) on 64-bit architectures is 32 bytes on MSVC, GCC, and Clang last time I checked), then you might get a bit more leverage out of the reference output parameter at the cost of purity (which I think is an enormous cost personally but it's one you might need to pay in response to measurements) using clear on a string object hoisted out of a loop as in your second example.

It's kind of unfortunate that std::string uses such a small buffer for its SBO/SSO. But it's a balancing act because now std::string somewhat sucks if you want to use it as keys in a hash map where the stride would be a whopping 32 bytes even with a measly 16 or so bytes devoted to its small buffer. It would suck way more with a bigger buffer for such use cases. Really we either need two string types for optimal efficiency (one for stack-related purposes, another for heap), or some fancy compile-time code generation and branching mechanism which can detect whether or not to use SSO/SBO depending on whether the string's lifetime in pinned to the LIFO nature of the stack (including when it's a member of some UDT). With backward compatibility concerns and absent a way to distinguish these cases, I can understand why the standard library vendors chose such a teeny size for the SBO.

I don't know how counter-productive you want to be but we use our own version for SBO-optimized strings that use a whopping 256 bytes for its small buffer similar to what C programmers often do but without the buffer overrun dangers in cases where the string exceeds 255 characters. We don't use std::string, and still don't find any reason to do so (actually in some cases even fewer reasons now with the SSO/SBO). If a heap allocation is incurred in those cases that require more than 256 bytes, it's typically going to be quite rare and trivial in time in our tuned cases. But of course, that means we have to be careful not to store these in containers since they'd blow up memory use and cache misses outside of contexts that just involve the stack. We have a whole separate dynamic string type along with interned strings for cases where strings are stored outside the stack.

Personally, I'd favor your top version though, no matter what the cost, until I measured it. Functional purity/referential transparency is such a desirable property with so many cores nowadays on your average desktop. If you're concerned about it, I'd be hugging a profiler right now and running it over and over on some tests like a maniac (I must admit I spend a whole lot of time on this, but at least it's less time than pondering). That's at least more productive than guessing about it. Let the profiler answer your design questions in the most critical execution paths. Guessing means there's a probability you might guess wrong and have to incur costly changes to the design.

Almost certainly the second version you have is going to be more efficient unless all your strings fit into the SBO size, but it's about how much more efficient it is that it's worth sacrificing things like functional purity and the ability to reason about thread-safety. Move ctors won't help as much BTW for anyone who thought about that. SBOs aren't so friendly with move ctors. We can swap the pointers for heap-allocation cases but we still need to deep copy several times over for the SBO/SSO cases, and that's worse for small strings than just deep copying. If you're seriously in doubt, you can always have both versions (pure and impure):

void encode(const Foo& foo, std::string& encoded) { // do the actual encoding of foo } std::string encode(const Foo& foo) { std::string str; encode(foo, str); return str; }

... and you can probably make the second version a function template. Then you leave some slack for yourself to optimize in response to any hotspots that crop up by transforming code to your second version. std::vector also has this problem in stack-related cases on a larger scale, since it doesn't even use an SBO/SSO (not even a really small buffer) if we're repeatedly creating teeny ones over and over in a large loop only to discard them. Actually, it's weird to me that the standard library authors prioritized small buffer optimizations for std::string than std::vector, since at least std::vector is probably not used that often as keys in an associative container. It was never efficient for containing a boatload of tiny sequences, so I think it should have been the priority for small buffer optimizations over strings. The legacy associated with std::string makes it much more difficult to optimize with SBOs than std::vector because only an idiot would store like a million std::vector instances in a container. But strings are something people might actually store in such abundance, and small buffer optimizations can actually degrade, rather than improve, performance in such cases.

Stack Exchange Network

Using output arguments in C++ to avoid dynamic allocations

4 Answers 4

Hot Network Questions

Using output arguments in C++ to avoid dynamic allocations

4 Answers 4

Related

Hot Network Questions