C++ huffman encoding - constructing encoded bytes efficiently

Question

I've written a huffman compression program that works as intended, however in order to intuitively lay out the processes required to write the compressed file, I used strings in critical functions that I suspect are causing a performance hit. I came here to get ideas from the community on ways that I can go about this with a more efficient approach.

My program reads the input, creates the huffman tree and creates a corresponding dictionary holding the codes for each byte which has the following struct:

struct dictentry { int byte; string enc; size_t depth; dictentry(int byte, string enc) { this->byte = byte; this->enc = enc; this->depth = enc.length(); } }; class HuffDict { vector<dictentry> dict; string b2enc(unsigned char byte); ...... };

So in the end each byte occuring in the input file is assigned a string with it's code for example: 54 - '001', 55 - '010' and so on.

Now in order to create the actual bytes that will be written in the output file, I have created a function that looks in the dictionary and returns the code for a byte argument. VS profiler listed this function highest in execution time:

string HuffDict::b2enc(unsigned char byte) { for (vector<dictentry>::iterator it = dict.begin(); it != dict.end(); ++it) { if ((*it).byte == byte) { return (*it).enc; } } return ""; }

In order to put together the actual bytes to write I am first creating a string of 0-1s by concatenating string codes and then read them in chunks of 8, where I use bitwise operations to eventually convert the string representation of a byte to a number. The following is a class method which does this exact thing and ENC_BUFFER is a constant number which I 've set to 16kB.

int HuffEncoder::encode_chunk() { int bitstreampos = 0; //position in the bitstream unsigned char byte = 0; //the byte to be added to the bitstream string str2enc = ""; //string of at least 8 bits to encode while ((bitstreampos <= ENC_BUFFER) && (!this->done)) { byte = 0; str2enc = ""; str2enc += this->strleftover; while (str2enc.length() < 8) { str2enc += (*dict).b2enc(buffer[this->pos]); this->pos++; if (this->pos >= this->len) { this->done = true; break; } } this->leftover = str2enc.length() - 8; if (this->leftover > 0) { this->strleftover = str2enc.substr(8, 8 + this->leftover); str2enc = str2enc.substr(0, 8); } else if (this->leftover < 0) { this->lastbits = str2enc.length(); } else { this->strleftover = ""; } //bitwise operations to construct byte from str2enc for (string::size_type i = 0; i < str2enc.size(); ++i) { if (str2enc[i] == '1') { byte += 1 << (7 - i); } } this->bitstream[bitstreampos] = byte; bitstreampos++; } return bitstreampos; //bitstreampos actually holds the byte size of the chunk }

It's declaration is a bit hectic but I'll include it in case it helps:

class HuffEncoder { HuffDict *dict; char *buffer; int len; char bitstream[ENC_BUFFER]; //the resulting bitstream after encoding int pos = 0; //current position in buffer int leftover = 0; //how many bits were left over from previous bitstream string strleftover = ""; bool done = false; public: unsigned char lastbits = 0; HuffEncoder(char* buffer, int len, HuffDict* dict) { this->buffer = buffer; this->len = len; this->dict = dict; } int encode_chunk(); //encodes a chunk of the buffer into the bitstream and returns the size of bitstream bool isdone() { return this->done; } char* get_bitstream_ptr() { return bitstream; }; };

So to wrap it all, my question is: is there a different approach I can use to create the resulting encoded bytes (probably with no string involvement)? Or any other kind of improvement that you can think of?

Did you think of using an integer and swapping bits, to encode the string ? — TDk, CommentedDec 17, 2017 at 14:31
@TheophileDano you mean store the byte codes as integers and according to their depth shift them until I reach 8 bits? — Tsaras, CommentedDec 17, 2017 at 17:39
see Austin's reply for what I actually meant. He introduced it better :) — TDk, CommentedDec 18, 2017 at 9:00

Oscar Smith · Accepted Answer · 2017-12-17 17:44:48Z

First let me say, congratulations for writing code that works before trying to make it fast.

That said, you'll find that almost all of your code will go away: huffman just isn't that hard to write.

Let's start with the function you identified as taking most of the time:

string HuffDict::b2enc(unsigned char byte) { for (vector<dictentry>::iterator it = dict.begin(); it != dict.end(); ++it) { if ((*it).byte == byte) { return (*it).enc; } } return ""; }

This is great style for traversing a vector of unknown data in unknown order looking for a match. But this isn't a vector of unknown data, is it? This is a vector of data keyed by a byte. What's a byte?

When you have an enumerable range that you need to use as a lookup, just use indexing! A byte is one of 256 values, of which you will likely be using less than 100. Instead of randomly storing values in your vector, order them by the byte value and then look them up:

string HuffDict::b2enc(unsigned char byte) { return dict_ordered_by_byte_value[byte]; }

In fact, you can simply build a table and store the table in your object, then you won't have to insert this function call at all. Think of the time the compiler will save by not working to inline it!

At some point, you will come to realize that your string based representation is "a little" (horribly) inefficient. At that point, I encourage you to adopt a string-like structure using integers: use one integer to store the length of the bit-string, and another integer to contain the actual bits. Integers are long enough that you can replace most of your existing code piecemeal and not worry about a string that is too long.

First of all, thanks. The dictionary lookup thing makes a lot of sense, as for the last part about representing with integers: you mean I should swap my string dictionary codes with corresponding ints and shift them in place using another int counting the length? — Tsaras, CommentedDec 17, 2017 at 17:52
Yes. I'm not sure what your time constraints are - either coding or running time. But you can get some speed using a std::bitset, or get all the speed using the bitwise instructions directly, at the cost of more coding time. — aghast, CommentedDec 17, 2017 at 18:37

Tsaras · Accepted Answer · 2017-12-18 14:45:09Z

I am posting my revised code based on suggestions in this thread: Added integer representation of dictionary codes for each byte, switched to indexing lookup and transformed the strings-dependent encode function to use the integer values and bit shifting. Result ~10 times faster execution.

Here is the new encode function:

int HuffEncoder::encode_chunk_ex() { int bitstreampos = 0; //position in the bitstream unsigned char encoded_byte = 0; //the byte to be added to the bitstream unsigned int curr_byte = this->leftover_byte; //byte forming by shifting bitcodes into place unsigned char bitlen = this->leftover_bits; //bit length of current byte unsigned char bitcodelen = 0; //length of bitcode for current code while ((bitstreampos <= ENC_BUFFER) && (!this->done)) { //first create an int holding more than 8 bits of encoded stream while (bitlen < 8) { bitcodelen = this->codelen[buffer[this->pos]]; curr_byte = curr_byte << bitcodelen; curr_byte += this->code[buffer[this->pos]]; this->pos++; bitlen += bitcodelen; if (this->pos >= this->len) { this->done = true; break; } } //second break the int into full 8-bit sequences and keep the remainder while (bitlen >= 8) { encoded_byte = curr_byte >> bitlen - 8; bitlen -= 8; curr_byte = (curr_byte << sizeof(curr_byte) * 8 - bitlen) >> sizeof(curr_byte) * 8 - bitlen; this->bitstream[bitstreampos] = encoded_byte; bitstreampos++; } //done flag means the last byte of buffer has been processed if ((this->done) && (bitlen > 0)) { encoded_byte = curr_byte << (8 - bitlen); this->bitstream[bitstreampos] = encoded_byte; bitstreampos++; this->lastbits = bitlen; } } this->leftover_bits = bitlen; if (!bitlen) { this->leftover_byte = 0; } else { this->leftover_byte = curr_byte; } return bitstreampos; //bitstreampos actually holds the byte size of the chunk }

The HuffEncoder class now has arrays containing the bitcodes and their depth for direct access from the method:

class HuffEncoder { HuffDict* dict; int code[256]; //int array holding bitcodes for faster access unsigned char codelen[256]; //char array holding bitcode length for faster access unsigned char *buffer; int len; char bitstream[ENC_BUFFER]; //the resulting bitstream after encoding int pos = 0; //current position in buffer int leftover_bits = 0; //how many bits were left over from previous chunk unsigned int leftover_byte = 0; //byte holding the leftover bits bool done = false; public: unsigned char lastbits = 0; HuffEncoder(unsigned char* buffer, int len, HuffDict* dict); int encode_chunk_ex(); bool isdone() { return this->done; } char* get_bitstream_ptr() { return bitstream; }; };

The encoding is done in chunks of size ENC_BUFFER using the following loop inside the main function:

HuffDict huffdict{ &tree }; HuffEncoder encoder{ buffer, length, &huffdict }; while (!encoder.isdone()) { int size = encoder.encode_chunk_ex(); out.write(encoder.get_bitstream_ptr(), size); }

Congratulations on the speedup. Question: What happens when the output buffer is not long enough? It seems like you silently return. — aghast, CommentedDec 18, 2017 at 14:33
It loops in chunks until it reaches the end of input. I included the part from my main function which does this. — Tsaras, CommentedDec 18, 2017 at 14:46

Stack Exchange Network

C++ huffman encoding - constructing encoded bytes efficiently

2 Answers 2

Hot Network Questions

C++ huffman encoding - constructing encoded bytes efficiently

2 Answers 2

Related

Hot Network Questions