I am looking for text compression algorithms (natural language compression, rather than compression of arbitrary binary data).
I have seen for example An Efficient Compression Code for Text Databases. This algorithm basically uses the words as symbols, creates a dictionary from them, and replaces them with integers. So something like this:
Hello I am a sentence. And hello I would be another sentence.
Breaking it into a dictionary would produce:
0: [space] 1: Hello 2: I 3: am 4: a 5: sentence 6: . 7: And 8: hello 9: would a: be b: another
Then the text is turned into:
1 0 2 0 3 0 4 0 5 6 0 7 0 8 0 2 0 9 0 a 0 b 0 5 6
Another compression approach that works better for numbers it sounds like, is LZ77 or LZ78. While I don't fully understand how to implement that, I get the gist of it. They somehow compare a small window of characters to gather sequences of individual numbers/characters/bytes, and put those in a dictionary.
So for our text example above, we can find a few sequences that repeat a few times:
en (in sentence) ello I (in [Hh]ello I)
Not a particular good example demonstrating sequences, but we could create a small dictionary from this:
#: ello I @: en
(choosing characters that are outside the letter ascii range, for example).
Then we can shorten it a bit:
Hello I am a sentence. And hello I would be another sentence. H# am a s@t@ce. And h# would be another s@t@ce.
(Probably we've actually added bytes to the data in this case, b/c of the dictionary and lack of good sequences, which is why this type of encoding isn't necessarily best on natural language text).
Compare that to what was before using the word/number technique (not taking into account the dictionary):
Hello I am a sentence. And hello I would be another sentence. H# am a s@t@ce. And h# would be another s@t@ce. 1020304056070802090a0b056
If we had a larger volume of text, there would be more copies of the same words, so it would shorten more.
What I'm wondering is, if there is a way to automatically find the best encoding for the bytes. Automatically find all the sequences that can be put into a dictionary. I don't see how that's not possible, but I imagine it is otherwise it would've been done already. It seems like it would be best solved in the area of DNA sequence analysis.
In DNA sequence analysis, I would guess there are techniques for finding "longest sequence matches" and such like that. Basically it seems like it would try this, given a string ATGCCATTCGCCGTTA
:
- Compare every 2-letter pair with each other. So
AT
,TG
,GC
, etc. - Compare every 3-letter pair. So
ATG
,TGC
,GCC
, etc. - Compare every 4-letter pair, 5-letter pair, etc.
- Stop comparing when you have found there is only 1 copy of that sequence in the entire text.
But it's hard to imagine the properties of this. If it would be extremely slow, I am not sure. I am not sure how much computation it would require.
But using this approach on the hello
string above, it would eventually discover the same patterns of en
and ello I
. At least it seems like it would.
In this sort of way, it seems like it's possible to automatically find all sequences in the text and find the best encoding for it. But I'm not sure if this is true, and I'm not sure why I haven't seen anything like this in the wild. Wondering if one could explain the reason for this, and/or explain why it's not possible to automatically find all sequences in the text to use for encoding.