merge
calls back()
on empty vector
The logic in your merge
function has a bug.
if (firstHalf.back() > secondHalf.back() && !firstHalf.empty()) { … }
The first problem is that you access firstHalf.back()
before the check that !firstHalf.empty()
. The checks should be in reversed order.
The second problem is that you're also accessing secondHalf.back()
without checking whether !secondHalf.empty()
.
While the code could be fixed with less changes, I suggest that you look at the problem again: Merging only really makes sense if neither of the halves is empty. So let's break the problem into two parts:
- Merge two non-empty containers.
- Append the remaining elements to the already merged container.
std::vector<float> combined{}; // Merge two non-empty containers. while (!firstHalf.empty() && !secondHalf.empty()) { if (firstHalf.back() > secondHalf.back()) { combined.push_back(firstHalf.back()); firstHalf.pop_back(); } else { combined.push_back(secondHalf.back()); secondHalf.pop_back(); } } // Append the remaining elements to the already merged container. while (!firstHalf.empty()) { combined.push_back(firstHalf.back()); firstHalf.pop_back(); } while (!secondHalf.empty()) { combined.push_back(secondHalf.back()); secondHalf.pop_back(); }
This version is not only correct but also simpler to understand.
Input does not handle excess spaces correctly
If you feed your program with input that puts more than one space between two numbers, your program will crash.
While it is a legitimate choice to require exactly one space between subsequent numbers, it turns out that you can make the behavior more user-friendly and simplify the code at the same time.
std::vector<float> floatVectorInput() { std::vector<float> numbers{}; std::string line{}; std::getline(std::cin, line); std::istringstream iss{line}; float x; while (iss >> x) { numbers.push_back(x); } if (!iss.eof()) { throw std::runtime_error{"Garbage input"}; } return numbers; }
Instead of first splitting into a vector of strings, I'm reading the float
s from the std::istringstream
directly. This will skip over any amount of white-space as desired.
The !iss.eof()
check is there because I want to make sure that we stop because the input is exhausted, not because there was something that is not a floating-point number.
Avoid using namespace std
I know that many C++ tutorials want you to put
using namespace std;
into every file. I don't think that this is good advice and recommend against doing it. Importing namespace
s is a complex operation and you probably don't understand all its implications. For example, what will the following code print with and without the using namespace std;
?
#include <iostream> #include <utility> //using namespace std; void swap(double x, double y) { std::clog << "swap(" << x << ", " << y << ")\n"; } int main() { int a = 1; int b = 2; swap(a, b); swap(3, 4); }
With the offending line commented out, the code will print swap(1, 2)
and swap(3, 4)
as you'd probably expect. However, with using namespace std
, it will only print the second line.
What happened?
By using namespace std
, we've made std::swap
(defined in <utility>
) visible. Since our own swap
takes double
s as parameters, calling it with an argument of type int
is not a perfect match. What the C++ compiler does is adding an implicit conversion from int
to double
. However, if there is also a function that doesn't require this conversion to happen, the compiler will prefer it. It just so happens that std::swap
is a template
that will be a better match in this case. So why is only the first call resolved to std::swap
then? This is because std::swap
takes its arguments as mutable references. A temporary object (like an integer literal in this case) doesn't bind to a mutable reference.
I understand that this is complicated stuff for a beginner and you probably shouldn't have to worry about it at this point. But if you're using namespace std
, you'll have to know it or you won't understand your code.
That said, using namespace std
is also frowned upon in production-quality code (written by people who ought to understand the aforementioned language rules) so you're teaching yourself a bad habit by using it. Just be explicit and prefix symbols from the standard library with std::
. It will tell the reader at a glance where the symbol comes from which makes understanding the code easier for readers of any level of experience.
Be const
correct
Your mergeSort
function doesn't modify its argument (sorts the vector in-place) but actually returns a new, sorted vector. Therefore, make its argument const
.
std::vector<float> mergeSort(const std::vector<float>& inputArray); // ^^^^^
You should do this with any variable that you don't intend to modify.
Use the correct integer types
This comment is actually wrong.
auto round the input array because size() returns int
std::vector::size
is of type std::vector::size_type
which is probably an alias for std::size_t
. In any case, it is an unsigned
integer type. Compiling your code with warnings enabled will warn you about this mixing of signed
and unsigned
types.
Unfortunately, std::size_t
is an unsigned
integer for historic reasons. This meas that you cannot let inidces go below zero which makes the loop conditions a bit more complicated in some places.
Make double
your go-to floating-point type
Unless you have a good reason to use float
, prefer double
for its greater range and precision. On modern hardware, you'll hardly notice a performance difference.
That's not to say there are no valid use-cases for float
. But unless you have such a case, your default-choice should be for double
.
Prefer '\n'
over std::endl
by default
std::endl
does more than outputting a new line. It also flushes the output stream. Which is an expensive operation.
Sometimes, this is just what you want. For example, in this line,
cout << "Array to sort (separate by spaces): " << endl;
you've used it correctly. The text must be visible before the instruction on the subsequent line gets executed or the program might start blocking for user input before the unlucky user was asked to provide some.
However, in all other places in your program, there is no need to flush. And output flushing is slow. It probably doesn't make any measurable difference for your small program but I've seen code where replacing std::endl
with '\n'
literally made the code ten times faster! And as a bonus: it's also less typing.
Consider using auto
type deduction (C++11)
C++11 introduced a great feature: auto
type deduction. This means that if you're declaring a variable and immediately initialize it with some value, you don't have to spell out its type. The compiler will deduce it for you.
So, for example, you could replace
vector<float> sorted = mergeSort(inputArray);
by
auto sorted = mergeSort(inputArray);
and the compiler will figure out for you that sorted
is of type vector<float>
.
Using auto
systematically can help you avoid repeating the same information over and over again and also avoid doing some silly mistakes that happen when you think you know the type.
(Applying the earlier advice, we should use const auto
here, of course.)
Avoid index-based for
loops (C++11)
You're using C-style for
loops throughout your code. While they are not wrong, they're less readable than ideal and allow you to make some avoidable bugs with the index bookkeeping. In some cases, they might also perform worse than ideal.
Since C++11, you can iterate over any standard library container using the following idiom, known as range-based for
loop.
Before:
for (int i = 0; i < stringArray.size(); i++) { array.push_back(stof(stringArray[i])); }
After:
for (auto&& s : stringArray) { array.push_back(stof(s)); }
Notice how I am using auto
type deduction here. The type of s
will be deduced to std::string&
here but you don't necessarily have to worry about it. You can always use auto&&
if you don't care about the actual type. Prefer const auto&
if you don't intend to modify the object, though. Don't use plain auto
as it will make an unnecessary copy.
Prefer standard library algorithms over raw loops (intermediate level)
The previous item notwithstanding, you should try to avoid raw loops in your code at all, if possible. (Search for Sean Parent's great “Better Code” talks and in particular for his “no raw loops” advice.)
For example, the loop from above could become this.
std::transform( std::begin(stringArray), std::end(stringArray), std::back_inserter(array), [](const std::string& s){ return std::stof(s); } );
I understand that this might not be very easy to understand for a beginner and it is perfectly acceptable to write your own loops at this point. But once you become more familiar with C++, standard algorithms are definitely something to look into.
(Taking this further, production code would of course use the standard library's std::stable_sort
algorithm instead of rolling its own merge sort implementation. But implementing algorithms yourself for learning is a good exercise.)
Make the code generic (intermediate level)
Your sorting logic can currently only be used for sorting std::vector<float>
. You could generalize it to work with any container of any type that defines a comparison function.
This is not a defect in your code. If you only have to sort std::vector<float>
, that's fine. In fact, it is better to write specific code that is correct than to write generic code that is flawed. Once you've learned about template
s and move semantics, you can attempt to make your code generic.
Avoid unnecessary copies
I liked this comment in your code.
Reverse merged vectors. Vectors don't have pop_front and I didn't want to use lists.
Indeed, using lists would be a bad idea here (and almost everywhere else, too). However, there is no need to create another vector. For one, you can reverse the elements of a vector in-place. In this case, however, an even better solution would be to use a std::deque
(which has pop_front
) instead of a std:vector
.
Of course, you could also re-work your code such that you don't actually pop the items off the front of your vector but merely iterate over it.
There are some other places where your code makes copies that could be avoided. For example, the splitting into the firstHalf
and secondHalf
vectors isn't necessary. You could instead always pass a reference to the original vector and a pair of indices. Alternatively (and preferably) use iterators.
An expert would probably also try to avoid the allocations for the merged intermediary results and instead use a single working array. But that is probably too complicated for now.
The naming could be improved slightly
Your code is generally very readable but at some places, I'd suggest looking carefully at the chosen names. For example, internal
doesn't really tell me anything useful. More as a matter of taste, I'd avoid names like stringArray
in favor of strings
. Especially since it is a std::vector
and not a std::array
.
Use tools to detect easy bugs
Some of the issues pointed out in this review could easily have been found by tools.
The most important tool is your compiler. Always compile with a high level of warnings enabled and treat them as errors. For GCC and Clang, I recommend you use the -Wall
, -Wextra
, -Werror
and-pedantic
options. There are even more warnings available but these should cover a good part.
If your C++ implementation supports it, build debug builds with a “checked STL”. That is, a modified standard library that performs more checking than required (or even allowed) by the standard. When using GCC, this is really simple. All you have to do is compile with the -D_GLIBCXX_DEBUG
option.
This cannot find all issues, however. So in addition, I recommend that you compile your code with a sanitizer. GCC and Clang both support the -fsanitize=TOOL
flag where TOOL
is the name of the sanitizer to use. I recommend that you use address
and undefined
routinely. Using these switches will instrument your binaries with some run-time checks.
Alternatively – or, preferably, in addition – you can use Valgrind which is a tool to instrument your binaries on the fly.
Instrumentation of your code is only useful, however, if you have appropriate tests that execute the instrumented code. In your case, testing would be easy. Just write a function that calls mergeSort
with a number of interesting inputs. “Interesting” in this case might be: an empty vector, a vector of a single element, a sorted vector, a vector sorted in reverse, a vector with duplicate elements, a random vector, etc. Of course, the test should also verify that the output is sorted correctly. Re-run these tests whenever you make modifications to your code to be alerted of newly introduced bugs immediately.
if(firstHalf.back() > secondHalf.back() && !firstHalf.empty())
will cause a segfault. Also, you keep creating and moving a lot of vectors, perhaps it is better to just keep one scratchpad vector and do the merging on it. Finally, you usually sort a vector by modifying it rather than creating another vector with elements sorted.\$\endgroup\$back()
on provably empty vectors. That is undefined behaviour and just waiting for the right input to crash.\$\endgroup\$