2

I am looking for text compression algorithms (natural language compression, rather than compression of arbitrary binary data).

I have seen for example An Efficient Compression Code for Text Databases. This algorithm basically uses the words as symbols, creates a dictionary from them, and replaces them with integers. So something like this:

Hello I am a sentence. And hello I would be another sentence. 

Breaking it into a dictionary would produce:

0: [space] 1: Hello 2: I 3: am 4: a 5: sentence 6: . 7: And 8: hello 9: would a: be b: another 

Then the text is turned into:

1 0 2 0 3 0 4 0 5 6 0 7 0 8 0 2 0 9 0 a 0 b 0 5 6 

Another compression approach that works better for numbers it sounds like, is LZ77 or LZ78. While I don't fully understand how to implement that, I get the gist of it. They somehow compare a small window of characters to gather sequences of individual numbers/characters/bytes, and put those in a dictionary.

So for our text example above, we can find a few sequences that repeat a few times:

en (in sentence) ello I (in [Hh]ello I)

Not a particular good example demonstrating sequences, but we could create a small dictionary from this:

#: ello I @: en 

(choosing characters that are outside the letter ascii range, for example).

Then we can shorten it a bit:

Hello I am a sentence. And hello I would be another sentence. H# am a s@t@ce. And h# would be another s@t@ce. 

(Probably we've actually added bytes to the data in this case, b/c of the dictionary and lack of good sequences, which is why this type of encoding isn't necessarily best on natural language text).

Compare that to what was before using the word/number technique (not taking into account the dictionary):

Hello I am a sentence. And hello I would be another sentence. H# am a s@t@ce. And h# would be another s@t@ce. 1020304056070802090a0b056 

If we had a larger volume of text, there would be more copies of the same words, so it would shorten more.

What I'm wondering is, if there is a way to automatically find the best encoding for the bytes. Automatically find all the sequences that can be put into a dictionary. I don't see how that's not possible, but I imagine it is otherwise it would've been done already. It seems like it would be best solved in the area of DNA sequence analysis.

In DNA sequence analysis, I would guess there are techniques for finding "longest sequence matches" and such like that. Basically it seems like it would try this, given a string ATGCCATTCGCCGTTA:

  1. Compare every 2-letter pair with each other. So AT, TG, GC, etc.
  2. Compare every 3-letter pair. So ATG, TGC, GCC, etc.
  3. Compare every 4-letter pair, 5-letter pair, etc.
  4. Stop comparing when you have found there is only 1 copy of that sequence in the entire text.

But it's hard to imagine the properties of this. If it would be extremely slow, I am not sure. I am not sure how much computation it would require.

But using this approach on the hello string above, it would eventually discover the same patterns of en and ello I. At least it seems like it would.

In this sort of way, it seems like it's possible to automatically find all sequences in the text and find the best encoding for it. But I'm not sure if this is true, and I'm not sure why I haven't seen anything like this in the wild. Wondering if one could explain the reason for this, and/or explain why it's not possible to automatically find all sequences in the text to use for encoding.

6
  • With a pre-defined dictionary, it is virtually impossible to get an optimal encoding for any given arbitrary text. On the other hand, if we use a dynamically defined / optimal dictionary for a particular given text, then we ought to consider the size of the dictionary itself and add that to the size of compressed result, since now both the custom dictionary and the encoded text must be shared to be meaningful.
    – Erik Eidt
    CommentedAug 23, 2018 at 4:51
  • Actually, in the first case. The dictionary needs to be part of the output. So the text might not have gotten shorter.
    – Euphoric
    CommentedAug 23, 2018 at 5:09
  • 4
    Dictionary-based compression is a huge part of existing compression methods. In brief, yes, it's possible to find a fitting dictionary automatically; no, it's not feasible to find the optimal one. This is why advances in compression tend to be of the type "We achieve a size reduction of 3% using 200% more computation".CommentedAug 23, 2018 at 6:06
  • Could you please clarify, because I got lost in the end: are you trying to optimize the size of the sliding windows in LZxx ? or are you trying to create your own substitution based algorithm based on blocks of chars instead of words ? Finally, what is unsatisfactory in the first algorithm (the key of it being not to substitute words by integers, but also to use an optimal variable length encoding of these integers and to reduce the size of the dictionary data) ?CommentedAug 23, 2018 at 7:08
  • @Christophe Yes I am wondering about "substitution based algorithm based on blocks of chars instead of words". The DNA sequencing example seems like they would search for sequences/patterns in a structured way. I don't understand the first paper I linked to yet, so not sure if it is satisfactory. I just wanted to know if there is an "ideal" solution to this problem, and why using a char-based substitution can't be ideal.CommentedAug 23, 2018 at 7:11

3 Answers 3

2

Text compression algorithms

Word based compression

You already explained yourself the basic principles of a dictionary compression that works by word. However interesting article that you referred to is about optimization of this process.

With your example of 62 characters original text broken down into 25 word example (counting spaces), you found that the compressed version was (a little, 20%) shorter than the original:

 Hello I am a sentence. And hello I would be another sentence. 1 0 2 0 3 0 4 0 5 6 0 7 0 8 0 2 0 9 0 a 0 b 0 5 6 

This simplified simulation misses however two things: the size of the numbers and their separation. Now imagine that your sentence is only an extract of a large text corpus using thousands of unique words. Your compression version might then very well look like:

 1001 0 200 0 30 0 4307 0 5277 638 0 7999 0 8034 0 200 0 901 0 4096 0 6348 0 5277 638 

Oops ! It's 35% larger than the original ! It's worse if you get rid of the separator and use a fixed size number of digits for each word (with 61% more than the original, its expansion, not compression ;-) ):

 1001000002000000003000004307000052770638000079990000803400000200000009010000409600006348000052770638 

Of course, the first thing we think is to store each number as a binary. 2 bytes are sufficient to store any number between 0 and 65535. For 25 word indexes, this means 50 bytes, so back to the original compression of 20%. But what if you'd do indexing of even larger text bodies with hundred thousands of potential words. Then you will need larget binary numbers, of each 4 bytes, or event 8 bytes. And again, your compression would not be very efficient.

What the algorithm is proposing, without entering into details, is to use a variable length encoding of the numbers, and make sure that the most frequent words get a short number, and the very rare words can afford to get a larger number. They do this by sorting the words by frequency, and after the sort assign the numbering. Ok, they also use some bit level tricks to handle more efficiently the variable length.

Finally, as someone pointed out in the comments, unless you use a huge universal and comprehensive dictionary, you will have to send the dictionary together ith the compressed text to allow decompression. Here again, the algorithm imprives the situation: instead of sending the dictionary (thousands of numbers, followed by thousands of words), it just send the list of words in the frequency order (this is sufficient to reverse engineer the codification).

While the algorithm details are quite elaborate, in the end, they achieve to store large text bodys in around 30% of their original size.

Character based dictionary compression

The LZxx algorithms don't use words, but analyse sequence of consecutive characters to find redundancy, by using a sliding window. In all honesty, it's complex. As complex as some basic binary data compression schemes. Fortunately, some guy demonstrated that these algorithms have all the properties of dictionary based algorithms.

Your own idea is also interesting. Your idea is not to use a sliding window, but to try to create blocks of characters of uniform size, and create a dictionary of these blocks. You then choose the optimal block size. For example:

 He#ll#o #I #am# a# s#en#te#nc#e.# A#nd# h#el#lo# I# w#ou#ld# b#e #an#ot#he#r #se#nt#en#ce#. # Hel#lo #I a#m a# se#nte#nce#. A#nd #hel#lo #I w#oul#d b#e a#not#her# se#nte#nce#. # #Hell#o I #am a# sen#tenc#e. A#nd h#ello# I w#ould# be #anot#her #sent#ence#. # 

So Here, 31 blocks of 2, 21 blocks of 3 and 16 blocks of 4. The redundancy is neglectible here. If you code these with a block dictionary, then the larger the block, the less blocks you need to store (but the larger the dictionary entries you have to transfer along the file). But in real life, with large corpuses of text, there are high chances, that you'll find a many possible combination of block occurences, and the larger the block size, the higher the number of combinations and size of the indexes.

It's a guess, but in the end, you'll only gain around the bits you save in combining the letters (e.g. if you store blocks of 3, you have one 26*26*26 possible combinations, so that 2 bytes are sufficient to encode each of that combination; the compression is 70% of the original size, far less than your word based compression ! )

So while your algorithm seems very elaborate, in the end, with large bodys of text, it will have exponential execution time, but with a very average compression efficiency. And so will any other algorithm perform, if it has to compare any part of the text with any other. Using ZIP will be more effective !

But what's the purpose ?

You've eliminated binary data algorithms, which are the most elaborate ones. I assume there is a reason. So instead of fine tuning algorithms, and reinventing the wheel of compression (see the many references of another answer to your question), a more important question is the purpose of your compression, and what are the expected properties of the compression scheme ?

The word compression algorithm is for example very effective for full text seach on full words: just look (case insensitive?) if the word is in the dictionary and it'll be much faster to find the occurences. It's even more effective for a free combination of words.

On the other side, it will become less attractive if you look for a partial search across several words (e.g. if "ce." should match "sentence" followed by "."). And your explicit storing of separators, will make it perform very poorly if there's a lot of extra whitespaces. Here an LZW would be much more efficient.

The block approaches are less effective for full text searches, as the searched string may be encoded in many different ways, depending on all the possible boundaries across blocks. On the other side, the elimination of consecutive redundant characters or block, can make you gain in the compression.

    0

    If you want to learn how compression work, there are several books to choose from, including free http://mattmahoney.net/dc/dce.html

    If you are looking for natural text compression, check http://mattmahoney.net/dc/text.html that lists lots of compression programs with their compression strength and compression/decompression speed. Benchmark was performed on first 1 GB of English Wikipedia text, so it very well represents English texts.

    In particular, take a look into LIBBSC - imho, it provides the best speed/ratio compromise for text data.

    If you are looking specifically for dictionary-based compression, check:

    1. GLZA
    2. https://encode.ru/threads/2313-Brotli
    3. https://facebook.github.io/zstd/#small-data
    4. https://github.com/antirez/smaz
    5. https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/

    Overall, encode.ru forum attracted a lot of compression professionals, so you may be more successful asking there. I believe that your question has nothing common with Software Engineering.

    1
    • encode.ru requires login, which is ridiculous for such a simple topic. sources: zstd + brotli + GLZA from 2016
      – milahu
      CommentedMar 10, 2022 at 21:32
    0

    You can find the best algorithm by trying each of every one on your text(s), of course.

    The thing is, would it be feasible? You can justify the time you spend optimizing this if you are going to use it millions of times with large texts. Otherwise you should pick one algorithm depending on your assessment and stick with it.

    The role of a software engineer in this situation is making this assessment, isn't it?

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.