Basic C++ implementation of linux wc command

Question

I am working on the Build Your Own wc Tool coding challenge in C++. I have a working implementation that seems to match the output of the wc command for different inputs. I am looking for feedback on

The code logic as well as C++ best practices for structuring the code.
Making it easier to test the correctness of this code with different inputs.

Dependencies

CLI11: A command-line parser

Any suggestions are appreciated since I am relatively new to C++.

#include <CLI/CLI.hpp> #include <iostream> #include <fstream> #include <string> #include <locale> #include <cwchar> struct Text_Counts { int characters{0}, lines{0}, words{0}; }; struct Text_Count_Flags { bool characters{false}, lines{false}, words{false}, multibyte_characters{false}; bool no_flags_set() { return !(characters || lines || words || multibyte_characters); } }; std::string counts_to_string(Text_Counts counts, const Text_Count_Flags &count_flags) { std::string output_string; if (count_flags.lines) { output_string += std::to_string(counts.lines) + " "; } if (count_flags.words) { output_string += std::to_string(counts.words) + " "; } if (count_flags.characters || count_flags.multibyte_characters) { output_string += std::to_string(counts.characters) + " "; } return output_string; } std::istream &get_input_stream(const std::string &filepath) { if (!filepath.empty()) { static std::ifstream input_file(filepath); if (!input_file) { throw std::runtime_error("Could not open file: " + filepath); } return input_file; } else { // return stdin stream if filepath is empty return std::cin; } } Text_Counts get_text_counts(std::istream &input_stream, const Text_Count_Flags &count_flags) { std::mbstate_t state = std::mbstate_t(); // initial state for mbrtowc Text_Counts counts; std::string line, word; // Iterate through each line of input stream while (std::getline(input_stream, line)) { if (count_flags.lines) { ++counts.lines; } if (count_flags.characters || count_flags.multibyte_characters) { if (count_flags.multibyte_characters) { // Count multi-byte characters const char *ptr = line.c_str(); const char *end = ptr + std::strlen(ptr); int len; wchar_t wc; int index = 0; while ((len = std::mbrtowc(&wc, ptr, end - ptr, &state)) > 0) { ptr += len; ++counts.characters; } ++counts.characters; } else { counts.characters += line.length() + 1; } } if (count_flags.words) { std::stringstream ss(line); while (ss >> word) { ++counts.words; } } } return counts; } int main(int argc, char *argv[]) { std::setlocale(LC_ALL, ""); // For multi-byte char counting CLI::App app{"wc command to count characters, word and lines"}; argv = app.ensure_utf8(argv); Text_Count_Flags count_flags; app.add_flag("-c", count_flags.characters, "Count Characters")->take_last(); app.add_flag("-m", count_flags.multibyte_characters, "Count Multi-Byte Characters")->take_last(); app.add_flag("-l", count_flags.lines, "Count Lines")->take_last(); app.add_flag("-w", count_flags.words, "Count Words")->take_last(); std::string filepath; app.add_option("filepath", filepath, "Input file path"); CLI11_PARSE(app, argc, argv); // If none of the flags are passed, use all flags if (count_flags.no_flags_set()) { count_flags.characters = count_flags.words = count_flags.lines = true; } std::istream &input_stream = get_input_stream(filepath); Text_Counts counts = get_text_counts(input_stream, count_flags); std::string output_string = counts_to_string(counts, count_flags) + filepath; std::cout << output_string; return 0; }

This may not be true for the last line. counts.characters += line.length() + 1; The last line may not be terminated by a newline character. — Loki Astari, CommentedNov 11, 2024 at 4:17
You should be able to specify multiple files on the command line, so I would expect you to have loop over each file. Also most unix system accept the file name -` as the standard input. — Loki Astari, CommentedNov 11, 2024 at 4:20
In Linux the file name - (dash) can be used. It means use std::cin or std::cout depending on context. But in wc it would represent an input file. — Loki Astari, CommentedNov 11, 2024 at 6:12
You don't need strlen on a std::string; they know their own length (in bytes) so you don't have to waste cycles scanning for the terminating zero. There's a .length() member function, or you could take a pointer to the .cend() element. en.cppreference.com/w/cpp/string/basic_stringstd::string is an explicit-length string, unlike traditional C strings. — Peter Cordes, CommentedNov 11, 2024 at 7:31

Toby Speight · Accepted Answer · 2024-11-11 07:25:13Z

The comments I am making here are to make the code easier to read, and maintain by others. Some of them are set forth in C++ style guides.

Since CLI is not part of the C++ standard headers I would expect the #include to use quotes instead of <>:

#include "CLI/CLI.hpp"

Variable Names

The count_flags names show a possible misunderstanding of the problem requirements. When I look at the programming challenge and when I do a man wc on my Linux system the -c flag (also --bytes) should be counting bytes and not characters. On a system that only uses ASCII (or any other single-byte character set) this will be the same as the -m--chars option, but otherwise there is no guarantee. It would be easier for anyone that maintained this code if one field was referred to as bytes and one field as characters.

Variable Declarations and Initialization

The best practice in C++ is to declare one variable per line with the initialization.

The code above (shown here):

struct Text_Counts { int characters{0}, lines{0}, words{0}; };

should actually be:

struct Text_Counts { int characters{0}; int lines{0}; int words{0}; };

The same is true for:

struct Text_Count_Flags { bool characters{false}; bool lines{false}; bool words{false}; bool multibyte_characters{false}; bool no_flags_set() { return !(characters || lines || words || multibyte_characters); } };

Vertical Spacing

Several of the functions need more vertical spacing to make them more readable. Each separate logic block should have some spacing.

Example

std::string counts_to_string(Text_Counts counts, const Text_Count_Flags &count_flags) { std::string output_string; if (count_flags.lines) { output_string += std::to_string(counts.lines) + " "; } if (count_flags.words) { output_string += std::to_string(counts.words) + " "; } if (count_flags.characters || count_flags.multibyte_characters) { output_string += std::to_string(counts.characters) + " "; } return output_string; }

Function Complexity

The functions main() and get_text_counts() are too complex (do too much). Each of these functions should be broken into smaller functions with only one task.

As programs grow in size, the use of main() should be limited to calling functions that parse the command line, calling functions that set up for processing, calling functions that execute the desired function of the program, and calling functions to clean up after the main portion of the program.

There is also a programming principle called the Single Responsibility Principle that applies here:

that every module, class, or function should have responsibility over a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by that module, class or function.

Return Values From `main()`

C++ has 2 macros defined for returning from main or other exits from the program: EXIT_SUCCESS and EXIT_FAILURE. These are system dependent and system defined; it is better to use these than return 0;.

Throwing Exceptions

If you are going to throw exceptions for errors, make sure your main() has a try{} catch{} block. Call the code that might throw an exception in the try{} block; the catch{} block should perform any necessary clean up before the program exits.

In the code presented here, it might be better to just return an error and let the main program exit gracefully with return EXIT_FAILURE;.

Choice of <> vs "" for includes is really about search path, i.e. whether the header lives in our own work tree or it comes from an installed library. So I disagree with your reasoning there (but might agree with the recommendation, depending on how the library is installed). — Toby Speight, CommentedNov 11, 2024 at 7:19
Echo what Toby mentioned on search path. If you have installed libraries with some package manager they live in a location where <> will find them. If they are local to only things that you use then "" is more appropriate. So if CLI/CLI.hpp is something you have installed on your system then use <> if it is something you have pulled in locally to the project by some build processes then use "". — Loki Astari, CommentedNov 11, 2024 at 17:43
I also disagree that EXIT_SUCCESS is better. Both 0 and EXIT_SUCCESS indicate successes (note they don't have to be the same value). But I would argue that using 0 is more standard (especially since it is the default when there is no return value in C++) and as a result more consistent if you want it to work with scripts (especially since most scripts will check for zero) so if EXIT_SUCCESS is not zero you are going to get some surprising behaviors. — Loki Astari, CommentedNov 11, 2024 at 17:50
@GaneshTata You can still keep get_text_counts() but internally it should call multiple functions. — pacmaninbw, CommentedNov 11, 2024 at 22:00
@LokiAstari I like self documenting code, and EXIT_SUCCESS and EXIT_FAILURE are clearly self documenting. For beginners starting out return 0 and return 1 may be harder to remember, that was based on what Unix programs should return on exit. — pacmaninbw, CommentedNov 11, 2024 at 22:04

Orlando · Accepted Answer · 2025-01-15 22:34:03Z

If you run your code with the -c flag on the input string "abc\ndef" and "abc\ndef\n", you'll get the same result of 8 both times.

Consider removing one of your ++counts.characters; lines in get_text_counts(), perhaps?

Also, I wouldn't recommend initializing your input_file with static. It works, but if you want to process multiple files in one run, it will never change the input_file since the constructor can only be run once. Consider adding a void method that takes in the input file as a reference and updates the address it points to?

Good point about input_file, but adding a void method is probably also not going to be great. Best would be if it returned something by value. On Linux, it could open /dev/stdin if filepath.empty(). Another overengineerd solution would be to return a std::variant<std::ifstream, std::ref<std::istream>>. Maybe it shouldn't be a function at all; the easiest solution would be to inline get_file_stream() and declare a non-static std::ifstream{} outside the if (filepaht.empty())-statement, and use the .open() method instead of the constructor. — G. Sliepen, CommentedJan 15 at 22:59

Stack Exchange Network

Basic C++ implementation of linux wc command

2 Answers 2

Variable Names

Variable Declarations and Initialization

Vertical Spacing

Function Complexity

Return Values From `main()`

Throwing Exceptions

Hot Network Questions

Basic C++ implementation of linux wc command

2 Answers 2

Variable Names

Variable Declarations and Initialization

Vertical Spacing

Function Complexity

Return Values From main()

Throwing Exceptions

Related

Hot Network Questions

Return Values From `main()`