7
\$\begingroup\$

This is a follow-up question for An Updated Multi-dimensional Image Data Structure with Variadic Template Functions in C++. I implemented histogram template function to get histogram of an image in this post.

The experimental implementation

  • histogram template function implementation

    // histogram template function implementation template<class ElementT = std::uint8_t> constexpr static auto histogram(const Image<ElementT>& input) { std::array<std::size_t, std::numeric_limits<ElementT>::max() - std::numeric_limits<ElementT>::lowest() + 1> histogram_output; std::fill(std::begin(histogram_output), std::end(histogram_output), std::size_t{ 0 }); auto image_data = input.getImageData(); for (std::size_t i = 0; i < image_data.size(); ++i) { ++histogram_output[image_data[i]]; } return histogram_output; } 
  • rgb2hsv template function implementation

    // rgb2hsv template function implementation template<typename ElementT, typename OutputT = HSV> requires (std::same_as<ElementT, RGB> || std::same_as<ElementT, RGB_DOUBLE>) constexpr static auto rgb2hsv(const Image<ElementT>& input) { return pixelwiseOperation([](ElementT input) { return rgb2hsv(input); }, input); } // rgb2hsv template function implementation template<class ExPo, typename ElementT, typename OutputT = HSV> requires (std::same_as<ElementT, RGB> || std::same_as<ElementT, RGB_DOUBLE>) && (std::is_execution_policy_v<std::remove_cvref_t<ExPo>>) constexpr static auto rgb2hsv(ExPo execution_policy, const Image<ElementT>& input) { return pixelwiseOperation(execution_policy, [](ElementT input) { return rgb2hsv(input); }, input); } 

The usage of histogram template function:

/* Developed by Jimmy Hu */ #include <chrono> #include "../base_types.h" #include "../basic_functions.h" #include "../image.h" #include "../image_io.h" #include "../image_operations.h" template<class ExPo, class ElementT> requires (std::is_execution_policy_v<std::remove_cvref_t<ExPo>>) constexpr static auto HistogramTest( ExPo execution_policy, const TinyDIP::Image<ElementT>& input, std::ostream& os = std::cout ) { auto hsv_image = TinyDIP::rgb2hsv(execution_policy, input); auto grayscale_image = TinyDIP::im2uint8(TinyDIP::getVplane(hsv_image)); auto start1 = std::chrono::system_clock::now(); auto histogram_result1 = TinyDIP::histogram(grayscale_image); auto end1 = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds1 = end1 - start1; os << "elapsed time: " << elapsed_seconds1.count() << '\n'; return histogram_result1; } int main() { auto start = std::chrono::system_clock::now(); std::string image_filename = "1.bmp"; auto image_input = TinyDIP::bmp_read(image_filename.c_str(), true); image_input = TinyDIP::copyResizeBicubic(image_input, 3 * image_input.getWidth(), 3 * image_input.getHeight()); auto histogram_result = HistogramTest(std::execution::par, image_input); for (std::size_t i = 0; i < histogram_result.size(); i++) { std::cout << i << " count: " << histogram_result[i] << " / " << image_input.count() << '\n'; } std::cout << "Sum = " << TinyDIP::recursive_reduce(histogram_result, std::size_t{ 0 }) << '\n'; auto end = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds = end - start; std::time_t end_time = std::chrono::system_clock::to_time_t(end); std::cout << "Computation finished at " << std::ctime(&end_time) << "elapsed time: " << elapsed_seconds.count() << '\n'; return EXIT_SUCCESS; } 

The output of the test code above:

Width of the input image: 454 Height of the input image: 341 Size of the input image(Byte): 464442 elapsed time: 0.0007759 0 count: 1966 / 1393326 1 count: 65 / 1393326 2 count: 72 / 1393326 3 count: 75 / 1393326 4 count: 71 / 1393326 5 count: 49 / 1393326 6 count: 67 / 1393326 7 count: 52 / 1393326 8 count: 78 / 1393326 9 count: 58 / 1393326 10 count: 60 / 1393326 11 count: 58 / 1393326 12 count: 66 / 1393326 13 count: 75 / 1393326 14 count: 86 / 1393326 15 count: 98 / 1393326 16 count: 105 / 1393326 17 count: 139 / 1393326 18 count: 301 / 1393326 19 count: 554 / 1393326 20 count: 1052 / 1393326 21 count: 1551 / 1393326 22 count: 2407 / 1393326 23 count: 3593 / 1393326 24 count: 5618 / 1393326 25 count: 8144 / 1393326 26 count: 10541 / 1393326 27 count: 12335 / 1393326 28 count: 14096 / 1393326 29 count: 15881 / 1393326 30 count: 16950 / 1393326 31 count: 17881 / 1393326 32 count: 18950 / 1393326 33 count: 19432 / 1393326 34 count: 20085 / 1393326 35 count: 20687 / 1393326 36 count: 21367 / 1393326 37 count: 21758 / 1393326 38 count: 22450 / 1393326 39 count: 23288 / 1393326 40 count: 23501 / 1393326 41 count: 23818 / 1393326 42 count: 24173 / 1393326 43 count: 23949 / 1393326 44 count: 23445 / 1393326 45 count: 22823 / 1393326 46 count: 22218 / 1393326 47 count: 21807 / 1393326 48 count: 21137 / 1393326 49 count: 20491 / 1393326 50 count: 19820 / 1393326 51 count: 19644 / 1393326 52 count: 19148 / 1393326 53 count: 18567 / 1393326 54 count: 18190 / 1393326 55 count: 17536 / 1393326 56 count: 16935 / 1393326 57 count: 16640 / 1393326 58 count: 15821 / 1393326 59 count: 15337 / 1393326 60 count: 14813 / 1393326 61 count: 14085 / 1393326 62 count: 13722 / 1393326 63 count: 13101 / 1393326 64 count: 12752 / 1393326 65 count: 12188 / 1393326 66 count: 11973 / 1393326 67 count: 11385 / 1393326 68 count: 11036 / 1393326 69 count: 10681 / 1393326 70 count: 10205 / 1393326 71 count: 9761 / 1393326 72 count: 9574 / 1393326 73 count: 9162 / 1393326 74 count: 9056 / 1393326 75 count: 8700 / 1393326 76 count: 8270 / 1393326 77 count: 8096 / 1393326 78 count: 7671 / 1393326 79 count: 7575 / 1393326 80 count: 7366 / 1393326 81 count: 7262 / 1393326 82 count: 7011 / 1393326 83 count: 6692 / 1393326 84 count: 6533 / 1393326 85 count: 6459 / 1393326 86 count: 5996 / 1393326 87 count: 5783 / 1393326 88 count: 5764 / 1393326 89 count: 5554 / 1393326 90 count: 5413 / 1393326 91 count: 5310 / 1393326 92 count: 5069 / 1393326 93 count: 5021 / 1393326 94 count: 5026 / 1393326 95 count: 4816 / 1393326 96 count: 4788 / 1393326 97 count: 4561 / 1393326 98 count: 4528 / 1393326 99 count: 4296 / 1393326 100 count: 4308 / 1393326 101 count: 4316 / 1393326 102 count: 4123 / 1393326 103 count: 4023 / 1393326 104 count: 3850 / 1393326 105 count: 3873 / 1393326 106 count: 3731 / 1393326 107 count: 3737 / 1393326 108 count: 3685 / 1393326 109 count: 3566 / 1393326 110 count: 3505 / 1393326 111 count: 3462 / 1393326 112 count: 3383 / 1393326 113 count: 3350 / 1393326 114 count: 3264 / 1393326 115 count: 3204 / 1393326 116 count: 3216 / 1393326 117 count: 3187 / 1393326 118 count: 3191 / 1393326 119 count: 3163 / 1393326 120 count: 2903 / 1393326 121 count: 3040 / 1393326 122 count: 2956 / 1393326 123 count: 2899 / 1393326 124 count: 2802 / 1393326 125 count: 2916 / 1393326 126 count: 2833 / 1393326 127 count: 2745 / 1393326 128 count: 2918 / 1393326 129 count: 2843 / 1393326 130 count: 2795 / 1393326 131 count: 2718 / 1393326 132 count: 2758 / 1393326 133 count: 2781 / 1393326 134 count: 2821 / 1393326 135 count: 2755 / 1393326 136 count: 2763 / 1393326 137 count: 2739 / 1393326 138 count: 2770 / 1393326 139 count: 2705 / 1393326 140 count: 2812 / 1393326 141 count: 2733 / 1393326 142 count: 2766 / 1393326 143 count: 2792 / 1393326 144 count: 2790 / 1393326 145 count: 2813 / 1393326 146 count: 2674 / 1393326 147 count: 2655 / 1393326 148 count: 2544 / 1393326 149 count: 2562 / 1393326 150 count: 2542 / 1393326 151 count: 2668 / 1393326 152 count: 2586 / 1393326 153 count: 2521 / 1393326 154 count: 2525 / 1393326 155 count: 2497 / 1393326 156 count: 2503 / 1393326 157 count: 2513 / 1393326 158 count: 2517 / 1393326 159 count: 2406 / 1393326 160 count: 2526 / 1393326 161 count: 2571 / 1393326 162 count: 2531 / 1393326 163 count: 2504 / 1393326 164 count: 2536 / 1393326 165 count: 2454 / 1393326 166 count: 2529 / 1393326 167 count: 2651 / 1393326 168 count: 2530 / 1393326 169 count: 2519 / 1393326 170 count: 2522 / 1393326 171 count: 2600 / 1393326 172 count: 2614 / 1393326 173 count: 2603 / 1393326 174 count: 2748 / 1393326 175 count: 2711 / 1393326 176 count: 2783 / 1393326 177 count: 2877 / 1393326 178 count: 2921 / 1393326 179 count: 2909 / 1393326 180 count: 2814 / 1393326 181 count: 2747 / 1393326 182 count: 2648 / 1393326 183 count: 2615 / 1393326 184 count: 2510 / 1393326 185 count: 2526 / 1393326 186 count: 2363 / 1393326 187 count: 2284 / 1393326 188 count: 2036 / 1393326 189 count: 2022 / 1393326 190 count: 1908 / 1393326 191 count: 1912 / 1393326 192 count: 1920 / 1393326 193 count: 1818 / 1393326 194 count: 1896 / 1393326 195 count: 1977 / 1393326 196 count: 2016 / 1393326 197 count: 2030 / 1393326 198 count: 2004 / 1393326 199 count: 1855 / 1393326 200 count: 1660 / 1393326 201 count: 1485 / 1393326 202 count: 1395 / 1393326 203 count: 1314 / 1393326 204 count: 1278 / 1393326 205 count: 1064 / 1393326 206 count: 899 / 1393326 207 count: 850 / 1393326 208 count: 847 / 1393326 209 count: 824 / 1393326 210 count: 834 / 1393326 211 count: 838 / 1393326 212 count: 839 / 1393326 213 count: 833 / 1393326 214 count: 913 / 1393326 215 count: 844 / 1393326 216 count: 807 / 1393326 217 count: 827 / 1393326 218 count: 837 / 1393326 219 count: 846 / 1393326 220 count: 889 / 1393326 221 count: 863 / 1393326 222 count: 871 / 1393326 223 count: 931 / 1393326 224 count: 909 / 1393326 225 count: 924 / 1393326 226 count: 950 / 1393326 227 count: 878 / 1393326 228 count: 895 / 1393326 229 count: 833 / 1393326 230 count: 909 / 1393326 231 count: 959 / 1393326 232 count: 870 / 1393326 233 count: 943 / 1393326 234 count: 915 / 1393326 235 count: 870 / 1393326 236 count: 897 / 1393326 237 count: 986 / 1393326 238 count: 925 / 1393326 239 count: 1003 / 1393326 240 count: 999 / 1393326 241 count: 1016 / 1393326 242 count: 1003 / 1393326 243 count: 1032 / 1393326 244 count: 1067 / 1393326 245 count: 1062 / 1393326 246 count: 1148 / 1393326 247 count: 1188 / 1393326 248 count: 1311 / 1393326 249 count: 1391 / 1393326 250 count: 1522 / 1393326 251 count: 1704 / 1393326 252 count: 2114 / 1393326 253 count: 2766 / 1393326 254 count: 4315 / 1393326 255 count: 39663 / 1393326 Sum = 1393326 Computation finished at Thu Feb 20 14:09:02 2025 elapsed time: 0.701304 

All suggestions are welcome.

TinyDIP on GitHub

All suggestions are welcome.

The summary information:

\$\endgroup\$
1
  • \$\begingroup\$Both templates rgb2hsv call another function of the same name that is not in the post.\$\endgroup\$CommentedFeb 20 at 7:58

2 Answers 2

7
\$\begingroup\$

You create an array with

std::numeric_limits<ElementT>::max() - std::numeric_limits<ElementT>::lowest() + 1 

elements, but then use the array as if std::numeric_limits<ElementT>::lowest() is 0. This obviously will index out of bounds if ElementT is a signed type and the image has negative values.

If ElementT is a 32-bit type, your histogram will be huge (and probably mostly unused). If it’s a 64-bit type you won’t be able to allocate the array. If it’s a floating-point type, you’ll be requesting a size using a floating-point type, which should be a compiler error (also a much larger value than what fits in size_t.

At least limit your template to 8 and 16-bit unsigned ints. Better would be to create another function for cases where the type is larger, where you decide on bin size and limits and compute which bin each value falls into.

I’m not sure if there ever is a situation where the input is known at compile time, and the compiler can compute the histogram. But it’s cool that a function like this can be constexpr!

\$\endgroup\$
    7
    \$\begingroup\$
    std::fill(std::begin(histogram_output), std::end(histogram_output), std::size_t{ 0 }); 

    A simpler solution is to aggregate-initialize the array (braces) instead of default-initializing it:

    std::array<std::size_t, /* skipped to save space */> histogram_output{}; 
    for (std::size_t i = 0; i < image_data.size(); ++i) { ++histogram_output[image_data[i]]; } 

    This is the simplest histogram algorithm that isn't bad, so it has a good "performance to complexity" ratio. There are some techniques to do it faster, they will introduce more complexity, so whether that's worth it is a trade-off that you'd have to decide. Turbo-Histogram has collected many of those techniques.

    A significant downside of the simple algorithm is as follows. If the same entry of the histogram is incremented again when it has recently been incremented already (as happens not very often for random data, but happens significantly often for various kinds of structured data including non-random images) then the second increment has a dependency on the first increment through the associated store and load. How bad that is depends on microarchitectural details of the CPU you're running the code on, but it generally isn't great and tends to result in significantly lower performance than independent increments.

    The algorithms available in Turbo-Histogram do some different things:

    • Build multiple independent histograms, and sum them in the end. This gives the CPU some independent work to do even if the same value occurs multiple times in a short span.
    • Detect short runs of equal values and count them all in one go (in clever ways that make detection cheap, but that keep runs somewhat short, eg up to 8 bytes). This avoids the worst case of the simple algorithm and turns it into the best case: runs of equal values can be counted much faster than one-at-a-time. But, if runs of equal values are not common enough, trying to find them is not necessarily a win.

    On top of that there are various optimizations such as replacing 16 individual byte loads with one 128-bit load (_mm_loadu_si128) and 16 extract-byte operations (_mm_extract_epi8), which variant is best depends a lot on the specific CPU microarchitecture: the costs of various operations on that microarchitecture and specific quirks. It's different between Intel Haswell and Intel Rocket Lake and so on.

    Of these, the only technique that can easily be agnostic of ElementT is building multiple independent histograms and summing them in the end. Other techniques could be used by having a specialization for the (presumably most common) case that ElementT = std::uint8_t.

    ElementT had better be an unsigned integral type and not too large, otherwise the whole thing breaks down. Perhaps that ought to be enforced.

    \$\endgroup\$
    2
    • \$\begingroup\$I'd guess that on x86, 32-bit or 64-bit scalar loads and unpacks would be the way to go if you want to relieve pressure on the load / AGU ports. e.g. mov eax, [rdi] / movzx ecx, al / movzx edx, ah / shr eax, 16 and repeat. How you get a compiler to emit that, IDK. The load with memcpy sure, but then it's up to the optimizer to use movzx to read from AH instead of doing more shifts and ANDs. Reading AH has extra latency but still good throughput at least on SKL.\$\endgroup\$CommentedFeb 21 at 22:45
    • \$\begingroup\$By comparison, vpextrb is 2 uops per scalar element; a shuffle (port 5) and an XMM->GPR uop (port 0 like movd). Similar on Zen 4: 2 uops with 1/clock throughput. (uops.info)\$\endgroup\$CommentedFeb 21 at 22:47

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.