Implementation of merge sort algorithm in C++

Question

This is my implementation of "merge sort" in C++. I'm still a newbie at C++ so:

Have I understood how merge sort works and implemented it in the right way?
Have I used any bad practices?
How could the code be improved upon in terms of efficiency?

Criticism is welcome and appreciated!

#include <iostream> #include <vector> #include <string> #include <sstream> using namespace std; vector<float> merge(vector<float> firstHalf, vector<float> secondHalf){ vector<float> combined; for(int i = firstHalf.size() + secondHalf.size(); i > 0; i--){//merge two vectors if(firstHalf.back() > secondHalf.back() && !firstHalf.empty()){ combined.push_back(firstHalf.back()); firstHalf.pop_back(); }else if(!secondHalf.empty()){ combined.push_back(secondHalf.back()); secondHalf.pop_back(); } } vector<float> revCombined;//reverse merged vectors. Vectors don't have pop_front and I didn't want to use lists. for(int i = 0; i < combined.size(); i++){ revCombined.push_back(combined[combined.size()-i-1]); } return revCombined; } vector<float> mergeSort(vector<float> &inputArray){//for example [9, 8, 1] as input if(inputArray.size() > 1){ vector<float> firstHalf; vector<float> secondHalf; for(int i = 0; i < inputArray.size()/2; i++){//auto round the input array because size() returns int firstHalf.push_back(inputArray[i]); }//first half = [9, 8] for(int i = inputArray.size()/2; i < inputArray.size(); i++){ secondHalf.push_back(inputArray[i]); }//second half = [1] return merge(mergeSort(firstHalf), mergeSort(secondHalf)); } else{ return inputArray; } } vector<string> split(string str, char delimiter) { vector<string> internal; stringstream ss(str); // Turn the string into a stream. string tok; while(getline(ss, tok, delimiter)) { internal.push_back(tok); } return internal; } vector<float> floatVectorInput(){ string inputString; getline(cin, inputString); vector<string> stringArray = split(inputString, ' '); vector<float> array; for(int i = 0; i < stringArray.size(); i++){ array.push_back(stof(stringArray[i])); } return array; } int main(){ cout << "Array to sort (separate by spaces): " << endl; vector<float> inputArray = floatVectorInput(); vector<float> sorted = mergeSort(inputArray); cout << endl << "Sorted Array:" << endl; for(int i = 0; i < sorted.size(); i++){ cout << sorted[i]; if(i == sorted.size()-1){ cout << endl << endl; }else{ cout << ", "; } } return 0; }

Have you tested your code? To me it seems that the check if(firstHalf.back() > secondHalf.back() && !firstHalf.empty()) will cause a segfault. Also, you keep creating and moving a lot of vectors, perhaps it is better to just keep one scratchpad vector and do the merging on it. Finally, you usually sort a vector by modifying it rather than creating another vector with elements sorted. — Raziman T V, CommentedJan 21, 2017 at 17:26
@RazimanT.V. The code seems to work correctly. You can play with it (and various inputs) here: ideone.com/rWbAsl — πάντα ῥεῖ, CommentedJan 21, 2017 at 17:37
The code performs back() on provably empty vectors. That is undefined behaviour and just waiting for the right input to crash. — Raziman T V, CommentedJan 21, 2017 at 17:40
What version of C++ are you using? If it is c++11 or c++14, please add the relevant tags to your question. — 5gon12eder, CommentedJan 21, 2017 at 18:15

Community · Accepted Answer · 2017-05-23 12:40:57Z

`merge` calls `back()` on empty vector

The logic in your merge function has a bug.

if (firstHalf.back() > secondHalf.back() && !firstHalf.empty()) { … }

The first problem is that you access firstHalf.back()before the check that !firstHalf.empty(). The checks should be in reversed order.

The second problem is that you're also accessing secondHalf.back() without checking whether !secondHalf.empty().

While the code could be fixed with less changes, I suggest that you look at the problem again: Merging only really makes sense if neither of the halves is empty. So let's break the problem into two parts:

Merge two non-empty containers.
Append the remaining elements to the already merged container.

std::vector<float> combined{}; // Merge two non-empty containers. while (!firstHalf.empty() && !secondHalf.empty()) { if (firstHalf.back() > secondHalf.back()) { combined.push_back(firstHalf.back()); firstHalf.pop_back(); } else { combined.push_back(secondHalf.back()); secondHalf.pop_back(); } } // Append the remaining elements to the already merged container. while (!firstHalf.empty()) { combined.push_back(firstHalf.back()); firstHalf.pop_back(); } while (!secondHalf.empty()) { combined.push_back(secondHalf.back()); secondHalf.pop_back(); }

This version is not only correct but also simpler to understand.

Input does not handle excess spaces correctly

If you feed your program with input that puts more than one space between two numbers, your program will crash.

While it is a legitimate choice to require exactly one space between subsequent numbers, it turns out that you can make the behavior more user-friendly and simplify the code at the same time.

std::vector<float> floatVectorInput() { std::vector<float> numbers{}; std::string line{}; std::getline(std::cin, line); std::istringstream iss{line}; float x; while (iss >> x) { numbers.push_back(x); } if (!iss.eof()) { throw std::runtime_error{"Garbage input"}; } return numbers; }

Instead of first splitting into a vector of strings, I'm reading the floats from the std::istringstream directly. This will skip over any amount of white-space as desired.

The !iss.eof() check is there because I want to make sure that we stop because the input is exhausted, not because there was something that is not a floating-point number.

Avoid `using namespace std`

I know that many C++ tutorials want you to put

using namespace std;

into every file. I don't think that this is good advice and recommend against doing it. Importing namespaces is a complex operation and you probably don't understand all its implications. For example, what will the following code print with and without the using namespace std;?

#include <iostream> #include <utility> //using namespace std; void swap(double x, double y) { std::clog << "swap(" << x << ", " << y << ")\n"; } int main() { int a = 1; int b = 2; swap(a, b); swap(3, 4); }

With the offending line commented out, the code will print swap(1, 2) and swap(3, 4) as you'd probably expect. However, with using namespace std, it will only print the second line.

What happened?

By using namespace std, we've made std::swap (defined in <utility>) visible. Since our own swap takes doubles as parameters, calling it with an argument of type int is not a perfect match. What the C++ compiler does is adding an implicit conversion from int to double. However, if there is also a function that doesn't require this conversion to happen, the compiler will prefer it. It just so happens that std::swap is a template that will be a better match in this case. So why is only the first call resolved to std::swap then? This is because std::swap takes its arguments as mutable references. A temporary object (like an integer literal in this case) doesn't bind to a mutable reference.

I understand that this is complicated stuff for a beginner and you probably shouldn't have to worry about it at this point. But if you're using namespace std, you'll have to know it or you won't understand your code.

That said, using namespace std is also frowned upon in production-quality code (written by people who ought to understand the aforementioned language rules) so you're teaching yourself a bad habit by using it. Just be explicit and prefix symbols from the standard library with std::. It will tell the reader at a glance where the symbol comes from which makes understanding the code easier for readers of any level of experience.

Be `const` correct

Your mergeSort function doesn't modify its argument (sorts the vector in-place) but actually returns a new, sorted vector. Therefore, make its argument const.

std::vector<float> mergeSort(const std::vector<float>& inputArray); // ^^^^^

You should do this with any variable that you don't intend to modify.

Use the correct integer types

This comment is actually wrong.

auto round the input array because size() returns int

std::vector::size is of type std::vector::size_type which is probably an alias for std::size_t. In any case, it is an unsigned integer type. Compiling your code with warnings enabled will warn you about this mixing of signed and unsigned types.

Unfortunately, std::size_t is an unsigned integer for historic reasons. This meas that you cannot let inidces go below zero which makes the loop conditions a bit more complicated in some places.

Make `double` your go-to floating-point type

Unless you have a good reason to use float, prefer double for its greater range and precision. On modern hardware, you'll hardly notice a performance difference.

That's not to say there are no valid use-cases for float. But unless you have such a case, your default-choice should be for double.

Prefer `'\n'` over `std::endl` by default

std::endl does more than outputting a new line. It also flushes the output stream. Which is an expensive operation.

Sometimes, this is just what you want. For example, in this line,

cout << "Array to sort (separate by spaces): " << endl;

you've used it correctly. The text must be visible before the instruction on the subsequent line gets executed or the program might start blocking for user input before the unlucky user was asked to provide some.

However, in all other places in your program, there is no need to flush. And output flushing is slow. It probably doesn't make any measurable difference for your small program but I've seen code where replacing std::endl with '\n' literally made the code ten times faster! And as a bonus: it's also less typing.

Consider using `auto` type deduction (C++11)

C++11 introduced a great feature: auto type deduction. This means that if you're declaring a variable and immediately initialize it with some value, you don't have to spell out its type. The compiler will deduce it for you.

So, for example, you could replace

vector<float> sorted = mergeSort(inputArray);

by

auto sorted = mergeSort(inputArray);

and the compiler will figure out for you that sorted is of type vector<float>.

Using auto systematically can help you avoid repeating the same information over and over again and also avoid doing some silly mistakes that happen when you think you know the type.

(Applying the earlier advice, we should use const auto here, of course.)

Avoid index-based `for` loops (C++11)

You're using C-style for loops throughout your code. While they are not wrong, they're less readable than ideal and allow you to make some avoidable bugs with the index bookkeeping. In some cases, they might also perform worse than ideal.

Since C++11, you can iterate over any standard library container using the following idiom, known as range-based for loop.

Before:

for (int i = 0; i < stringArray.size(); i++) { array.push_back(stof(stringArray[i])); }

After:

for (auto&& s : stringArray) { array.push_back(stof(s)); }

Notice how I am using auto type deduction here. The type of s will be deduced to std::string& here but you don't necessarily have to worry about it. You can always use auto&& if you don't care about the actual type. Prefer const auto& if you don't intend to modify the object, though. Don't use plain auto as it will make an unnecessary copy.

Prefer standard library algorithms over raw loops (intermediate level)

The previous item notwithstanding, you should try to avoid raw loops in your code at all, if possible. (Search for Sean Parent's great “Better Code” talks and in particular for his “no raw loops” advice.)

For example, the loop from above could become this.

std::transform( std::begin(stringArray), std::end(stringArray), std::back_inserter(array), [](const std::string& s){ return std::stof(s); } );

I understand that this might not be very easy to understand for a beginner and it is perfectly acceptable to write your own loops at this point. But once you become more familiar with C++, standard algorithms are definitely something to look into.

(Taking this further, production code would of course use the standard library's std::stable_sort algorithm instead of rolling its own merge sort implementation. But implementing algorithms yourself for learning is a good exercise.)

Make the code generic (intermediate level)

Your sorting logic can currently only be used for sorting std::vector<float>. You could generalize it to work with any container of any type that defines a comparison function.

This is not a defect in your code. If you only have to sort std::vector<float>, that's fine. In fact, it is better to write specific code that is correct than to write generic code that is flawed. Once you've learned about templates and move semantics, you can attempt to make your code generic.

Avoid unnecessary copies

I liked this comment in your code.

Reverse merged vectors. Vectors don't have pop_front and I didn't want to use lists.

Indeed, using lists would be a bad idea here (and almost everywhere else, too). However, there is no need to create another vector. For one, you can reverse the elements of a vector in-place. In this case, however, an even better solution would be to use a std::deque (which has pop_front) instead of a std:vector.

Of course, you could also re-work your code such that you don't actually pop the items off the front of your vector but merely iterate over it.

There are some other places where your code makes copies that could be avoided. For example, the splitting into the firstHalf and secondHalf vectors isn't necessary. You could instead always pass a reference to the original vector and a pair of indices. Alternatively (and preferably) use iterators.

An expert would probably also try to avoid the allocations for the merged intermediary results and instead use a single working array. But that is probably too complicated for now.

The naming could be improved slightly

Your code is generally very readable but at some places, I'd suggest looking carefully at the chosen names. For example, internal doesn't really tell me anything useful. More as a matter of taste, I'd avoid names like stringArray in favor of strings. Especially since it is a std::vector and not a std::array.

Use tools to detect easy bugs

Some of the issues pointed out in this review could easily have been found by tools.

The most important tool is your compiler. Always compile with a high level of warnings enabled and treat them as errors. For GCC and Clang, I recommend you use the -Wall, -Wextra, -Werror and-pedantic options. There are even more warnings available but these should cover a good part.

If your C++ implementation supports it, build debug builds with a “checked STL”. That is, a modified standard library that performs more checking than required (or even allowed) by the standard. When using GCC, this is really simple. All you have to do is compile with the -D_GLIBCXX_DEBUG option.

This cannot find all issues, however. So in addition, I recommend that you compile your code with a sanitizer. GCC and Clang both support the -fsanitize=TOOL flag where TOOL is the name of the sanitizer to use. I recommend that you use address and undefined routinely. Using these switches will instrument your binaries with some run-time checks.

Alternatively – or, preferably, in addition – you can use Valgrind which is a tool to instrument your binaries on the fly.

Instrumentation of your code is only useful, however, if you have appropriate tests that execute the instrumented code. In your case, testing would be easy. Just write a function that calls mergeSort with a number of interesting inputs. “Interesting” in this case might be: an empty vector, a vector of a single element, a sorted vector, a vector sorted in reverse, a vector with duplicate elements, a random vector, etc. Of course, the test should also verify that the output is sorted correctly. Re-run these tests whenever you make modifications to your code to be alerted of newly introduced bugs immediately.

Wow!! Fantastic! This is the most insightful thing I've read in a while! What a brilliant answer! Thank You! — user2635139, CommentedJan 21, 2017 at 22:38
Here are my improvements so far: github.com/BenjaminThomas1999/Merge-Sort/blob/master/main.cpp — user2635139, CommentedJan 21, 2017 at 22:40

Community · Accepted Answer · 2017-05-23 12:41:02Z

Have I understood how merge sort works and implemented it in the right way?

You understood the algorithm correctly, and your code yields the desired results.

Have I used any bad practices?
How could the code be improved upon in terms of efficiency?

See my further points below please:

1. Don't use `using namespace std;`

While that would work in your particular case, it's considered bad practice. Especially when you move out your code to separate header files.

See more details here please:

Why is “using namespace std;” considered bad practice?

2. Check constraints in order

This code may call undefined behavior, if the constraints aren't checked first (logical boolean arithmetic operations are executed in order):

if(firstHalf.back() > secondHalf.back() && !firstHalf.empty()){

Since there's the possibility calling std::vector::back() with an empty vector, that code should be

if(!firstHalf.empty() && !secondHalf.empty() && firstHalf.back() > secondHalf.back()){

to avoid calling the offensive statements

3. Simplify your data input processing

Your data input function can be simplified a lot. You don't need another split() function to do this. Simply use a std::istringstream to do this:

vector<float> floatVectorInput(){ string inputString; getline(cin, inputString); vector<float> array; std::istringstream iss(inputString); float val; while(iss >> val){ array.push_back(val); } return array; }

4. Prefer loops and dynamically allocated stacks over recursion

Recursive function calls like

return merge(mergeSort(firstHalf), mergeSort(secondHalf));

always are limited to the (OS defined) stack size regarding the call stack depth.

On large input lists this may bail out with a Stack Overflow error.

You can simply replace that with a std::stack<std::pair<std::vector<float>,std::vector<float>>> structure that is handled within a loop.

@RazimanT.V. Ah, the pop_back() stuff, yes. THX, I've been overlooking that. — πάντα ῥεῖ, CommentedJan 21, 2017 at 20:09

Stack Exchange Network

Implementation of merge sort algorithm in C++

2 Answers 2

`merge` calls `back()` on empty vector

Input does not handle excess spaces correctly

Avoid `using namespace std`

Be `const` correct

Use the correct integer types

Make `double` your go-to floating-point type

Prefer `'\n'` over `std::endl` by default

Consider using `auto` type deduction (C++11)

Avoid index-based `for` loops (C++11)

Prefer standard library algorithms over raw loops (intermediate level)

Make the code generic (intermediate level)

Avoid unnecessary copies

The naming could be improved slightly

Use tools to detect easy bugs

1. Don't use `using namespace std;`

2. Check constraints in order

3. Simplify your data input processing

4. Prefer loops and dynamically allocated stacks over recursion

Hot Network Questions

Implementation of merge sort algorithm in C++

2 Answers 2

merge calls back() on empty vector

Input does not handle excess spaces correctly

Avoid using namespace std

Be const correct

Use the correct integer types

Make double your go-to floating-point type

Prefer '\n' over std::endl by default

Consider using auto type deduction (C++11)

Avoid index-based for loops (C++11)

Prefer standard library algorithms over raw loops (intermediate level)

Make the code generic (intermediate level)

Avoid unnecessary copies

The naming could be improved slightly

Use tools to detect easy bugs

1. Don't use using namespace std;

2. Check constraints in order

3. Simplify your data input processing

4. Prefer loops and dynamically allocated stacks over recursion

Related

Hot Network Questions

`merge` calls `back()` on empty vector

Avoid `using namespace std`

Be `const` correct

Make `double` your go-to floating-point type

Prefer `'\n'` over `std::endl` by default

Consider using `auto` type deduction (C++11)

Avoid index-based `for` loops (C++11)

1. Don't use `using namespace std;`