I have a collection of strings which have a lot of common substrings, and I'm trying to find a good way to define tokens to compress them.
For instance, if my strings are:
s1 = "String" s2 = "Bool" s3 = "String -> Bool" s4 = "String -> String" s5 = "(String -> String) -> String -> [Bool]"
then I might want to use the tokens:
$1 = "String" $2 = "Bool" $3 = "$1 -> $1"
so that the strings may be defined as:
s1 = "$1" s2 = "$2" s3 = "$1 -> $2" s4 = "$3" s5 = "($3) -> $1 -> [$2]"
(In fact, now it's clear that the definition $4 = " -> "
might be good one to add.)
I'm looking for a good (perhaps the best?) way to choose the token definitions. I'm interested in minimizing the total length of the token definitions + the resulting string definitions.
Any ideas?
Update
It's kinda related to this SO question: Huffman encoding with variable length symbols