some misunderstanding in concept of Huffman algorithm

Question

What is difference between Average length of codes and Average length of codewords in Huffman Algorithm? is both the same meaning? I get stuck in some facts:

I see a fact that marked as False:

for a text that it's characters get from set of n alphabets then average length of codes is O(log n).

I ran into a bigger challenge when see these two facts:

The average codeword length in Huffman's algorithm is Omega(log n).
The average codeword length in Huffman's algorithm is O(log n).

I need a very small example to clear the difference between these concepts.

Update:

This is the answer that why this fact is false: (i.e: average length of codes is O(log n) is false because:)

A lot here comes down to a simple fact: a lot of people abuse big-O notation quite a lot. Quite a few people use it for things like average cases and expected cases rather than upper bound. Quite a few also seem to use it simply to mean "approximately" or "on the general order of", rather than specifically for "asymptotically approaches". — Jerry Coffin, CommentedJan 3, 2021 at 22:52
I might consider it an answer to some question, but it's much more about the notation in general, not specific to Huffman encoding. — Jerry Coffin, CommentedJan 3, 2021 at 22:56
@JerryCoffin I don't think it's a big problem to use Big O notation for average or expected cases as long as you qualify it as such. — JimmyJames supports Canada, CommentedJan 6, 2021 at 15:59

lennon310 · Accepted Answer · 2021-01-06 15:18:14Z

What is difference between Average length of codes and Average length of codewords

By definition, an alphabet is a set of things {A₁, . . . , A_N} that we might wish to encode.

A code C is a mapping from an alphabet {A₁, . . . , A_N} to a set of finite length binary strings. C(A_j) is called the codeword for symbol A_j.

The length λ_j of a codeword C(A_j) is the number of bits of this codeword.

Assume a code C over an alphabet of N symbols, and probabilities p(A_i). Let λ_i be the length of codeword C(A_i). Then, the average length of code C is the sum of p(A_i)λ_i over all the i. As a result in general, the average code length is bounded below by the shortest codeword length and bounded above by the longest codeword length.

And this definition looks exactly the same as the average length of codewords.

for a text that its characters get from set of n alphabets then average length of codes is O(log n).

so by this definition, the average length of an optimal code generated by Huffman algorithm is always at most ⌈log𝑛⌉ (an upper bound with the smallest integer that greater than log n, as discussed here). More often entropy is used to lower bound the average code length (ref).

However, from your explanation on solution, the context seems different --- average in your question is about sum of all codeword lengths divided by the total number. Then this should be Ω(log n) instead (a proof FYI), meaning the best case average length would be log n (not the worst case as O(log n) indicates). And it is consistent with your example of 1-bit, 2-bit, ...2^n-1 bit strings.

gnasher729 · Accepted Answer · 2021-01-03 13:00:50Z

The first statement doesn’t make any sense to me.

For the second, assume an alphabet with 2^32 elements, one element comes up with probability 1/2, the others with probability 0.5 / (2^32 - 1).

Stack Exchange Network

some misunderstanding in concept of Huffman algorithm

2 Answers 2

Hot Network Questions

some misunderstanding in concept of Huffman algorithm

2 Answers 2

Related

Hot Network Questions