Markov sentence generator on input file

Question

This is a Markov chain generator. It reads a file, reads all triples of words, and creates the Markov model. The model it makes is weighted, by which I mean that if "word1 word2 word3" occurs more frequently in the input, then it is also more likely to occur in the output.

I think I'm more competent in languages other than C++, which is why I use C++ here. I appreciate any feedback.

#include <unordered_map> #include <vector> #include <iostream> #include <fstream> #include <algorithm> #include <random> #include <cctype> typedef std::pair<std::string, std::string> KeyPair; typedef std::vector<std::string> Words; typedef std::unordered_map<KeyPair, Words> Markov; static std::mt19937 gen; void init_rng() { std::random_device rd; gen.seed(rd()); } std::string random_element(const std::vector<std::string> & v) { std::uniform_int_distribution<> distrib(0, v.size() - 1); int r = distrib(gen); return v[r]; } template<> struct std::hash<KeyPair> { size_t operator()(const KeyPair & p) const { size_t hash1 = std::hash<std::string>{}(p.first); size_t hash2 = std::hash<std::string>{}(p.second); return hash1 ^ (hash2 << 1); } }; Markov make_markov(std::string fname){ std::ifstream infile(fname); if (!infile.is_open()) { throw 1; } Markov markov; std::string word1, word2, word3; infile >> word1 >> word2; while (infile >> word3) { markov[{word1, word2}].push_back(word3); word1 = word2; word2 = word3; } // Avoid empty possibles from end of input. markov[{word1, word2}].push_back(""); markov[{word2, ""}].push_back(""); return markov; } void generate_random_output(Markov & markov, int limit){ std::vector<KeyPair> keys; keys.reserve(markov.size()); for (auto & kp : markov) { keys.push_back(kp.first); } std::uniform_int_distribution<> distrib(0, keys.size() - 1); std::string word1, word2; while (true) { unsigned int r = distrib(gen); KeyPair kp = keys[r]; word1 = kp.first; word2 = kp.second; if (std::isupper(word1[0])) { break; } } std::cout << word1 << " " << word2 << " "; int cur = 2; while (cur < limit) { std::string word3 = random_element(markov[{word1, word2}]); if (word3 == "") { std::cout << ". "; generate_random_output(markov, limit - cur); return; } std::cout << word3 << " "; word1 = word2; word2 = word3; cur++; } } int main() { init_rng(); Markov markov = make_markov("pride_and_prejudice.txt"); for (int i = 0; i < 10; i++) { generate_random_output(markov, 100); std::cout << "\n\n"; } }

indi · Accepted Answer · 2025-01-19 19:58:11Z

Ah, Markov chains. So much fun! Some of my earliest C++ programs used Markov chains to generate text, just like this. Then I’d feed them a funny corpus—like Victorian smut—and see what came out.

#include <unordered_map> #include <vector> #include <iostream> #include <fstream> #include <algorithm> #include <random> #include <cctype>

Because of the nature of #include, it is generally wise to organize your includes in a logical and consistent order.

Remember that #include is just a straight-up textual inclusion; #include <vector> just literally dumps the contents of the file vector in situ. An unfortunate side effect of that is that include order matters. If you do #include <file1>\n#include <file2>, everything in file1 will be visible to file2, which may change the meaning of the stuff in file2. For a well-written include file—as one would assume a standard library include file is—this shouldn’t be true… but better safe than sorry.

That’s why it’s generally wise to, for example, sort your include files by library, then alphabetically.

This won’t be an issue if you’re using modules, of course.

typedef std::pair<std::string, std::string> KeyPair; typedef std::vector<std::string> Words; typedef std::unordered_map<KeyPair, Words> Markov;

None of this is incorrect, but the more modern way to make type aliases is with using. For example: using KeyPair = std::pair<std::string, std::string>;.

using has three benefits. First, and most importantly, it can do things that typedef simply cannot. So if you use typedef, then sometimes you will have to use using anyway. Consistency is better.

Second, using is much clearer, especially for complicated aliases.

And third, there is a general trend in modern C++ toward constructs that look like operation name = whatever;, rather than the hodge-podge of forms that “classic” constructs sport. Using using not only creates more consistency among type aliases, it makes for more consistency with the rest of the language.

static std::mt19937 gen;

First, if you want to make a translation-local object, function, class, or whatever else, the better way to do is to use an anonymous namespace:

namespace { std::mt19937 gen; } // anonymous namepace

This has numerous benefits, not least that you can make anything translation-local, not just functions and objects.

But more importantly, I’m sure you’ve at least heard that global variables are a code smell. Everybody seems to have heard that global variables are “bad”, but very few seem to have heard why global variables are “bad”.

There are actually a bunch of reasons why global variables are bad, but the one I want to focus on here is testability. Global variables make testing code extremely difficult, and code that is not tested is garbage code. By “garbage code” I don’t mean that the code is necessarily bad; I mean that it is practically useless. I could write the most amazing, most powerful, most feature-full, most efficient library of all time… but then when someone writing a real-world project wants to use it and asks me “does it work?”, if I answer “maybe? LOL!”… I think you can see why my library won’t be of any use to them.

The number one most important skill in programming—for any programming language—is knowing how to properly test your code. If you can’t do that, you’re as useless a programmer as that library I mentioned above.

I have been harping on the importance of testing for years, and for a long time people argued the point or brushed it off… but that’s changing now. Even just in the last 2–3 years, I have seen a massive shift in attitudes toward the importance of properly testing code. This is especially true in C++, because C++ very often powers mission-critical—and often life-critical—systems. Just take a look at the conference talks for the last 5 or more years, and you will see a dramatic spike in the number of testing talks recently. Testing is important, and people are really cluing into that fact now.

Now, you can argue: “Okay, sure testing is important for professional or public projects… but this is just my own private learning project! I don’t need to test it!” Sure, I mean, honestly, let’s be real: I don’t actually test my play projects either. BUT… testing is about more than just the testing. You also have to write code that is test-able. Writing code that can be tested is as important—if not more so—than writing the actual tests. Being a good C++ programmer—being a good programmer in any language—means learning how to write code that (at least) can be tested (even if it’s not actually tested, which, admittedly, it doesn’t need to be for learning or play projects).

Global variables generally make testing code difficult, if not impossible. That is (one of the many reasons) why they are bad.

But let’s be concrete, not theoretical. Why does this global variable create a problem? Well, let’s consider how it is used. gen is ultimately used in two places (which, technically only needs to be one place, but we’ll get to that later): to randomly select a key, and then to randomly select a word from that key. Let’s focus on just the random_element() function (foreshadowing!). Here’s the literally multi-trillion dollar software-industry question: “does it work?”

Well? Does it?

You’ve probably done some eyeball testing and it looks like works, right? But how can you be sure the results are actually random? Or, the flip side: if you used the same random sequence, how can you be sure it will reproduce the same output? How do you know it can select any string from the string vector? What if it only selects even-numbered strings? What if it only selects the first ~100 strings and never any higher? Here’s an actual real concern: what if you’ve screwed up the bounds and it never selects the last string? … or if it might select one-past-the-end?

Basically: does it work?

You don’t know. You can’t know, because you haven’t tested it. Which, for a personal play project is fine… but if you ever wanted to reuse any of this in a real project—and a random_element() function could be very useful in real projects—you want it to be at least test-able, so that if or when you ever need to test it, you can.

But you can’t test this function. Not properly anyway.

How do we fix that?

Well, the solution is to make the function “pure”… or at least as pure as possible. Its output should only rely on its inputs. That means we need to make the random generator one of its inputs:

auto random_element(std::vector<std::string> const& v, std::mt19337& gen) -> std::string;

Even better would be to make the random generator a template, so we could use any type of random generator:

template<std:: uniform_random_bit_generator Gen> auto random_element(std::vector<std::string> const& v, Gen& gen) -> std::string;

Now we can ditch the global variable, and modify init_rng() to return the initialized random generator:

auto init_rng() { return std::mt19337{std::random_device{}()}; }

… and now, when/if you ever need it, random_element() is trivially testable.

Now, speaking of random_element()… right now it is hard-coded to only pick random strings out of a vector<string>. We can generalize this.

First, instead of hard-coding the source to be a vector of strings, let’s make it a template that can take anything. How do we do that? Well, let’s start by generalizing the operations.

So this is where we’re starting from:

std::string random_element(const std::vector<std::string> & v) { std::uniform_int_distribution<> distrib(0, v.size() - 1); int r = distrib(gen); return v[r]; }

The first non-general operation there is v.size(). The generalized form of that is: std::ranges::size(v). That was easy.

The next operation is v[r]. Now, unfortunately, there is no direct generalized form of v[r]. So what we have to do is first get the begin iterator to v, and then advance it, and then dereference it:

std::string random_element(const std::vector<std::string> & v) { std::uniform_int_distribution<> distrib(0, std::ranges::size(v) - 1); int r = distrib(gen); auto i = std::ranges::begin(v); std::ranges::advance(i, r); // or, if you want bounds-checking: std::ranges::advance(i, r, std::ranges::end(v)); return *i; }

It may look like more code, but that’s what v[i] is doing under the hood anyway, so it doesn’t actually change the efficiency.

And what it gains us is now we can use any type of input range:

template<std::ranges::forward_range R> std::string random_element(R const& v) { std::uniform_int_distribution<> distrib(0, std::ranges::size(v) - 1); int r = distrib(gen); auto i = std::ranges::begin(v); std::ranges::advance(i, r); // or, if you want bounds-checking: std::ranges::advance(i, r, std::ranges::end(v)); return *i; }

Now this function will pick a random element out of a vector<string> a list<string> or whatever. And it doesn’t even need to be a string anymore! You could pick a random int out of std::array<int, N>, for example, or pick a random map key with random_element(some_map | std::views::keys); (more foreshadowing!).

While we’re at it, let’s add the RNG parameter:

template<std::ranges::forward_range R, std:: uniform_random_bit_generator Gen> std::string random_element(R const& v, Gen& gen) { std::uniform_int_distribution<> distrib(0, std::ranges::size(v) - 1); int r = distrib(gen); auto i = std::ranges::begin(v); std::ranges::advance(i, r); // or, if you want bounds-checking: std::ranges::advance(i, r, std::ranges::end(v)); return *i; }

There are also a couple of bugs that we need to fix.

Let’s start with the easy one: what if the input sequence is empty? Right now, you’ll just get a crash, because size(v) will return zero, then you’ll subtract one and it will wrap around into some huge number, and then you’ll try to dereference v[some huge number], and inevitably get a fault.

Easy fix: just do something like if (std::ranges::empty(v)) throw std::invalid_argument{"empty input range"};.

The second bug is a bit more nefarious. The problem arises in two lines:

 std::uniform_int_distribution<> distrib(0, v.size() - 1); int r = distrib(gen);

Here we are assuming that the size is an int, or at least that an int is “good enough” to hold the size. That is not a great assumption. A vector’s size is stored in its size_type, which is guaranteed to be able to hold larger values than an int. You’ll probably never notice the problem, because you’ll probably never have a vector that large… but you might!

The correct type to use for a vector<string> would be vector<string>::size_type or vector<string>::difference_type. But the generic form would be std::ranges::range_size_t<R> or std::ranges::range_difference_t<R>.

Choosing whether to use the size type or the difference type is extremely complicated, but to sum it up very briefly: if you just want to know the size, use the size type… but if you intend to do maths on the size, use the difference type. In this case, you are planning to do maths, so you should probably choose the difference type.

So to fix the first line, you would need to do something like:

using size_type = std::ranges::range_difference_t<R>; auto distrib = std::uniform_int_distribution{size_type{}, size_type(std::ranges::size(v) - 1)};

But while we’re at it, since we’re already getting the size, we might as well use it to solve that first bug:

using size_type = std::ranges::range_difference_t<R>; auto const size = size_type(std::ranges::size(v)); // or auto const size = std::ranges::distance(v); if (size == 0) throw std::invalid_argument{"empty input range"}; auto distrib = std::uniform_int_distribution{size_type{}, size_type(size - 1)};

You may be wondering why I sometimes use size_type{} and sometimes size_type(), and why sometimes 0 and sometimes size_type{}. Well, here is the heuristic:

My concern is always about unwanted or unexpected narrowing or promotion. Because we don’t know exactly what type size_type is, we don’t know if it is higher than int on the promotion chain, or lower (or neither). So:
1. we don’t know if size_type{1} - 1 will be a size_type or an int (or potentially something else!); and
2. we don’t know if an int can implicitly convert to a size_type.
Whenever we have a value with a type that we don’t know can implicitly convert to a size_type, we don’t know if size_type x = value; would compile without warnings. We also don’t know if size_type{value} would compile, because the braces protect against narrowing… and we don’t know if narrowing would happen. Thus, whenever we are converting from something that might not be a size_type, we need to use size_type(value), which forces the conversion, and disables narrowing checks.
Thus, because we don’t know what type size - 1 will be, we need to do size_type(size - 1), and not just size - 1 or size_type{size - 1}. We know, logically, that this will always be safe (after we’ve confirmed that size isn’t zero). But the compiler can’t necessarily know that.
size{} will always be zero. size{0} will also always be zero, and always safe. But I find that size_type{} better expresses that I want the zero value of a size_type, rather than saying “I want to convert0 to a size_type”.
I could do if (size == size_type{}), and that would be technically more correct… but I really don’t care if or whether either side of that equals gets promoted. It will be testing for zero in any case. So if (size == 0) is fine, and I think it’s clearer.
The type of 0 is int. So if I did uniform_int_distribution{0, size_type(size - 1)}, the deduction will get confused because the first argument type is int while the second is size_type. I do uniform_int_distribution{size_type{}, size_type(size - 1)} so the deduction is clear: this will be a uniform_int_distribution<size_type>. And no, I can’t just do uniform_int_distribution<size_type>{0, size - 1}, because int → size_type might not be an implicit conversion; it might require an explicit cast, so I have to do either uniform_int_distribution<size_type>{size_type{0}, size_type(size - 1)} or uniform_int_distribution<size_type>{size_type{}, size_type(size - 1)} anyway… might as well minimize the repetition as much as possible, and just let the deduction happen.

As you can see, implicit promotions are a pain in the ass in generic code. But if you’re diligent with your casting, they’re manageable.

Anywho, now the fix for the second line is trivial: just use auto:

auto r = distrib(gen);

Speaking of auto, that should also be the return type, so you can find more than just strings:

template<std::ranges::forward_range R, std:: uniform_random_bit_generator Gen> auto random_element(R const& v, Gen& gen) { using size_type = std::ranges::range_difference_t<R>; auto const size = std::ranges::distance(v); if (size == 0) throw std::invalid_argument{"empty input range"}; auto distrib = std::uniform_int_distribution{size_type{}, size_type(size - 1)}; auto const r = distrib(gen); auto i = std::ranges::begin(v); std::ranges::advance(i, r); return *i; }

Now, adding all this flexibility to this function might seem like unnecessary effort since you’re only using it in one place… but… are you only using it in one place?

Consider that now that this function can pick any kind of value out of any kind of sequence. Now consider this first bit of generate_random_output():

void generate_random_output(Markov & markov, int limit){ std::vector<KeyPair> keys; keys.reserve(markov.size()); for (auto & kp : markov) { keys.push_back(kp.first); } std::uniform_int_distribution<> distrib(0, keys.size() - 1); std::string word1, word2; while (true) { unsigned int r = distrib(gen); KeyPair kp = keys[r]; word1 = kp.first; word2 = kp.second; if (std::isupper(word1[0])) { break; } } // ...

So, what you’re doing there is creating a whole vector that is just a copy of all the keys in markov, then picking a random element out of that vector. Hello! Picking a random element out of a vector? We have that technology!

void generate_random_output(Markov& markov, int limit) { std::vector<KeyPair> keys; keys.reserve(markov.size()); for (auto & kp : markov) { keys.push_back(kp.first); } // Don't need this anymore: //std::uniform_int_distribution<> distrib(0, keys.size() - 1); std::string word1, word2; while (true) { //unsigned int r = distrib(gen); //KeyPair kp = keys[r]; auto kp = random_element(keys, gen); // <--- word1 = kp.first; word2 = kp.second; if (std::isupper(word1[0])) { break; } } // ...

But wait! he says, in his salesperson voice. There’s more! random_element() now not only returns any type… it takes takes any type of range as input. That means it doesn’t need a vector. It can work just fine with a view. So all we need is a view of the keys in markov, and we can avoid the expense of creating an entire vector:

void generate_random_output(Markov& markov, int limit) { // All of this is now unnecessary: //std::vector<KeyPair> keys; //keys.reserve(markov.size()); //for (auto & kp : markov) //{ // keys.push_back(kp.first); //} std::string word1, word2; while (true) { auto kp = random_element(markov | std::views::keys, gen); // <--- word1 = kp.first; word2 = kp.second; if (std::isupper(word1[0])) { break; } } // ...

And there is actually a way to optimize this still further. The key is: instead of returning the element from random_element()… return an iterator instead. If you just return a random iterator from a collection, then you don’t need to copy anything. You could get a random iterator, check the keys to see if it starts a sentence, and pick a random word out of its value, and print it all without a single string copied. I’ll leave that as an exercise for the reader.

template<> struct std::hash<KeyPair> { size_t operator()(const KeyPair & p) const { size_t hash1 = std::hash<std::string>{}(p.first); size_t hash2 = std::hash<std::string>{}(p.second); return hash1 ^ (hash2 << 1); } };

This is all technically correct, except for one detail: it is std::size_t, not size_t. Some compilers and standard libraries have traditionally allowed it unqualified, but that is very likely to change in the near future.

Note that the problem disappears if you use auto, which is one of the many reasons you should:

template<> struct std::hash<KeyPair> { auto operator()(const KeyPair & p) const { auto hash1 = std::hash<std::string>{}(p.first); auto hash2 = std::hash<std::string>{}(p.second); return hash1 ^ (hash2 << 1); } };

Also, in C++, we put type modifiers with the type. In other words:

const KeyPair &p: this is C style
const KeyPair& p: this is C++ style
KeyPair const& p: also C++ style
const KeyPair & p: neither style, and too easily confused for a binary operator

Markov make_markov(std::string fname){

For maximum flexibility, you should take a stream, not a file name. That way you could use any kind of input stream: a file stream, a string stream, a network stream… anything. You could keep the simplicity by simply using overloads:

Markov make_markov(std::istream& infile) { // ... } Markov make_markov(std::filesystem::path const& fname) { auto infile = std::ifstream{fname}; if (not infile.is_open()) throw 1; // should probably throw a proper exception here return make_markov(infile); }

Also, note that if you are dealing with file names/paths, you should really be using std::filesystem::path, not just std::string. There are a lot of technical reasons why, but I’ll summarize with 1) it encodes the fact that it is a file name/path, and not just a random string; and 2) std::filesystem::path handles subtle filesystem character encoding issues.

To make this function properly testable, you should also take the random generator as a parameter.

 std::string word1, word2, word3; infile >> word1 >> word2;

You should really check that there were no errors at this point. Your system assumes at least two words were successfully scanned. You might want to verify that that’s the case.

 while (infile >> word3) { markov[{word1, word2}].push_back(word3); word1 = word2; word2 = word3; }

This could be optimized, and probably significantly so, with carefuly application of std::move(). But it would be tricky.

 markov[{word1, word2}].push_back(""); markov[{word2, ""}].push_back("");

Couple things here.

First, you should always prefer emplace_*() to push_*(). It makes little difference here, but it can make a lot of difference in other situations.

Second, whenever you want an empty string, you should never use "". You should always use std::string{}. Why? Because std::string{} is direct, clear, and no-fail, whereas std::string{""} or std::string s = "" or whatever else requires an extra step, parsing an (empty) string literal, and is not no-fail.

For related reasons, it is better to do word.empty() rather than word == "", or even word == std::string{}.

But again, you should be using emplace_back(), not push_back(), so that makes things even simpler:

 markov[{std::move(word1), word2}].emplace_back(); markov[{std::move(word2), std::string{}}].emplace_back();

On to generate_random_output():

void generate_random_output(Markov & markov, int limit){

To make this function testable, you should add the output stream as a parameter. Even without considering testing, you may want to output to something other than cout. You may want to write to a file, or to a memory stream for display in a GUI, or further modification. (Interesting how making functions more testable automatically makes them more powerful, eh?)

 word1 = word2; word2 = word3;

As in the previous case, you can probably use some moves here to get a quick-and-easy efficiency boost.

That’s about it for the review! If you’d like a suggestion or two for some future extensions/improvements:

I’d suggest making a proper, bespoke Markov type, rather than just aliasing unordered_map<pair<string, string>, vector<string>>. There are a couple of reasons for this. For starters, there is the possibility of much more efficient storage: rather than a triple getting more weight by simply repeating the last word, you could store each word with a count.

With that, I would add the ability to store the data in an optimized form. That way you could avoid reparsing the same corpus over and over; just parse it once and save the result, then reuse it to generate over and over. Exactly what that optimized form would be is up to you, but I would suggest considering making an enumerated list of the strings, then just using their indices and probabilities. So if “truth” has index 100, “universally” has index 999, and “acknowledged” has index 12 and “believed” has index 64, one record in the data could be “100 999 0.9 12 0.1 64”. Even better, encode that as binary data.

And then, of course, you could add more sophisticated parsing, to take into account the difference between a period at the end of a sentence versus the period at the end of “Mr.”, or sentence, clause, and/or paragraph terminators. And maybe make the depth adjustable.

Happy coding!

\$\begingroup\$Thank you, I learned a lot from this answer.\$\endgroup\$
– davidlowryduda
CommentedJan 20 at 14:58 — davidlowryduda, CommentedJan 20 at 14:58

Toby Speight · Accepted Answer · 2025-01-18 18:46:36Z

std::size_t is consistently misspelled. Your standard library is evidently forgiving of this, but it's not something you should depend on in portable code.

 if (std::isupper(word1[0])) { break; }

Note that the <cctype> functions require a positive integer argument (or EOF), but here we're passing a possibly-negative char value.

 std::string word1, word2, word3; infile >> word1 >> word2;

We should test whether this >> was successful before we use word1 and word2.

Stack Exchange Network

Markov sentence generator on input file

2 Answers 2

Hot Network Questions

Markov sentence generator on input file

2 Answers 2

Related

Hot Network Questions