finding optimal token definitions for compression

Question

I have a collection of strings which have a lot of common substrings, and I'm trying to find a good way to define tokens to compress them.

For instance, if my strings are:

s1 = "String" s2 = "Bool" s3 = "String -> Bool" s4 = "String -> String" s5 = "(String -> String) -> String -> [Bool]"

then I might want to use the tokens:

$1 = "String" $2 = "Bool" $3 = "$1 -> $1"

so that the strings may be defined as:

s1 = "$1" s2 = "$2" s3 = "$1 -> $2" s4 = "$3" s5 = "($3) -> $1 -> [$2]"

(In fact, now it's clear that the definition $4 = " -> " might be good one to add.)

I'm looking for a good (perhaps the best?) way to choose the token definitions. I'm interested in minimizing the total length of the token definitions + the resulting string definitions.

Any ideas?

Update

It's kinda related to this SO question: Huffman encoding with variable length symbols

Well - I want to use this particular compression algorithm partly because the decompression routine is very easy to implement. Also, it allows me to decompress any one of the strings without having to expand any of the others. Most compression altos require you to decompress bytes 1 - n in order to determine what byte n+1 is. — ErikR, CommentedMay 27, 2016 at 7:09

Peter Taylor · Accepted Answer · 2016-08-02 08:16:22Z

The generic term for what you're trying to do is grammar-based compression. In particular, you seem to be trying to solve the smallest grammar problem.

See e.g. Charikar, Lehman, Lehman, Liu, Panigrahy, Prabhakaran, Sahai, Shelat (2005). The Smallest Grammar Problem. IEEE Transactions on Information Theory 51 (7): 2554–2576

comingstorm · Accepted Answer · 2016-06-03 00:50:17Z

Lempel-Ziv style compression does the kind of thing you're looking for.

For your application, it sounds like you want a fixed "string dictionary" (LZ's internal representation of your token definitions), rather than the dynamic one used by standard streaming compression programs like gzip or compress. The general Lempel-Ziv scheme is readily adaptable to this kind of thing.

If nothing else, be aware that Lempel-Ziv-style techniques can naturally help compress the token definition data itself.

Stack Exchange Network

finding optimal token definitions for compression

2 Answers 2

Hot Network Questions

finding optimal token definitions for compression

2 Answers 2

Related

Hot Network Questions