In software, we often use key-value data structures, where each key is unique and maps to a specific value. Common examples include dictionaries in Python, hash maps in Java, and objects in JavaScript. If you combine arrays with key-value data structures, you can represent most data.
In C++, we have two standard key-value data structures: the std::map and the std::unordered_map. Typically, the std::map is implemented as a tree (e.g., a red-black tree) with sorted keys, providing logarithmic time complexity O(log n) for lookups, insertions, and deletions, and maintaining keys in sorted order. The std::unordered_map is typically implemented as a hash table with unsorted keys, offering average-case constant time complexity O(1) for lookups, insertions, and deletions. In the std::unordered_map, the hash table uses a hashing function to map keys to indices in an array of buckets. Each bucket is essentially a container that can hold multiple key-value pairs, typically implemented as a linked list. When a key is hashed, the hash function computes an index corresponding to a bucket. If multiple keys hash to the same bucket (a collision), they are stored in that bucket’s linked list.
Quite often, we only need to look at the keys, or look at the values. The C++20 standard makes this convenient through the introduction of ranges (std::ranges::views::keys and std::ranges::views::values). Let us consider two functions using the ‘modern’ functional style. The first function sums the values and the next function counts how many keys (assuming that they are strings) start with a given prefix.
template<typename map_type>auto sum_values(const map_type&map){auto values =map|std::ranges::views::values;returnstd::accumulate(values.begin(), values.end(),typename map_type::mapped_type{});}template<typename map_type>size_t count_keys_with_prefix(const map_type&map,std::string_view prefix){auto keys =map|std::ranges::views::keys;returnstd::ranges::count_if(keys,[prefix](std::string_view key){return key.starts_with(prefix);});}
template<typename map_type>auto sum_values_daniel(map_type &map){typename map_type::mapped_type sum{};for(constauto&value :map|std::ranges::views::values){ sum += value;}return sum;}template<typename map_type>size_t count_keys_with_prefix_daniel(const map_type &map,std::string_view prefix){size_tcount=0;for(constauto&key :map|std::ranges::views::keys){if(key.starts_with(prefix)){++count;}}returncount;}
When you do not have C++20, you can write the same functions using a more conventional style:
template<typename map_type>typename map_type::mapped_type sum_values_cpp11(const map_type&map){typename map_type::mapped_type sum =typename map_type::mapped_type();for(typename map_type::const_iterator it =map.begin(); it !=map.end();++it){ sum += it->second;}return sum;}template<typename map_type>size_t count_keys_with_prefix_cpp11(const map_type&map,conststd::string& prefix){size_tcount=0;for(typename map_type::const_iterator it =map.begin(); it !=map.end();++it){conststd::string& key = it->first;if(key.size()>= prefix.size()&& key.compare(0, prefix.size(), prefix)==0){++count;}}returncount;}
Let us compare the performance. I have generated a large key-value map with a thousand keys (as strings) with corresponding integer values. All my keys begin with ‘key_’ and I ask my functions to count how many keys begin with ‘key_’ (that should be all of them). I use both an std::map and std::unordered_map data source. I record how many nanoseconds I need to spend per key checked. I use two machines, one with an Intel Ice Lake processor running at 3.2 GHz with GCC 12 and LLVM 16 as the compilers and one with an Apple M2 processor running at 3.5 GHz with LLVM 15 as the compiler. They do not use the same instruction set, so we cannot directly compare the number of instructions retired: we expect an ARM-based system like the Apple M2 to use a few more instructions.
Using an std::map I get the following timings and instruction counts:
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 2.2 ns | 5.5 ns | 5.8 ns |
C++20 loop | 2.0 ns | 5.5 ns | 5.8 ns |
C++11 | 4.0 ns | 5.5 ns | 5.5 ns |
C++11 with starts_with | 2.0 ns | 5.5 ns | 5.8 ns |
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 46 instructions | 29 instructions | 49 instructions |
C++20 loop | 41 instructions | 29 instructions | 49 instructions |
C++11 | 58 instructions | 30 instructions | 30 instructions |
C++11 with starts_with | 43 instructions | 30 instructions | 49 instructions |
Using an std::unordered_map, I get the following timings and instruction counts:
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 1.1 ns | 3.8 ns | 3.8 ns |
C++20 loop | 0.9 ns | 5.4 ns | 3.8 ns |
C++11 | 0.9 ns | 5.4 ns | 3.8 ns |
C++11 with starts_with | 0.9 ns | 5.4 ns | 3.8 ns |
function | Apple M2/LLVM15 | Intel Ice Lake/GCC12 | Intel Ice Lake/LLVM16 |
---|---|---|---|
C++20 functional | 35 instructions | 10 instructions | 30 instructions |
C++20 loop | 30 instructions | 13 instructions | 30 instructions |
C++11 | 26 instructions | 14 instructions | 12 instructions |
C++11 with starts_with | 32 instructions | 14 instructions | 30 instructions |
We find that iterating over an std::map instance is slower than iterating over an std::unordered_map instance.
Performance comparisons between Apple M2/LLVM15 and Intel Ice Lake/GCC12 systems reveal striking differences in C++ code execution. While functional approaches (e.g., C++20 ranges) can outperform traditional loops by up to 40% under GCC 12, using starts_with for string comparisons dramatically halves running time for std::map on Apple/LLVM15.
Despite small input sizes fitting in cache, Intel’s iteration performance is not compute-bound, achieving under 2 instructions per cycle (IPC), while Apple’s exceeds 4 IPC, highlighting its architectural advantage in optimizing data access patterns.
Daniel Lemire, "Streamlined iteration: exploring keys and values in C++20," in Daniel Lemire's blog, April 20, 2025, https://lemire.me/blog/2025/04/20/iterating-through-keys-and-values-in-c-with-c20-code/.
The mixing of compilers and CPUs makes it hard to see how much of the performance difference is due to the compiler LLVM/GCC and how much is due to CPU arch. ARM/Intel.
A more fair comparison would be between the same version of LLVM on both CPUs.
It might be interesting to see if using std::reduce instead of std::accumulate shaves off any operations.
The inner loop of the “C++11 starts_with” variant gets faster if you replace the std::string& parameter by a std::string_view (which is C++17). Apart from instruction order, the generated code is then equivalent to “C++20 functional”. IOW, the difference appears to be due to the aliasing rules.