A Conversation for MPEG Audio Layer 3 (MP3) Technical Guide

Huffman compression

Post 1

The Cow

A similar system is used in morse code.
E, the most common letter, is "."
Q, is "--.-"


Huffman compression

Post 2

Ausnahmsweise, wie üblich (Consistently inconsistent)

The given example is not quite accurate. No single code can be the prefix of another, longer code.
If you wanted to compress, say, written English, you would count the frequency of each letter in the sample text and sort them into ascending order. Then you start looking for the lowest two frequencies. You're going to build a tree from the leaves to the root. Add the two lowest frequencies, create a node with the value of the sum and connect the leaves that contributed to them. Next iteration, include the new node (but not the leaves) in the search for the next two lowest. Join them with a node. Eventually everything meets at a root node. Now you walk the tree, from the root, assigning, say 0 for a left branch and 1 for a right branch. When you reach a leaf (a letter) its Hufman code is the string of 0s and 1s you chained together. If the letter E represented more then 50% of the population you would have a very unbalance tree. "E" might be a single 1. No other letter's code would begin with one because they are all down the left branch of the tree. (A picture would really help.) All of this is in the classic programmer's text by Knut.

I have used it to compress text. I used adaptive Hufman first for the letters, then I did the whole thing over again for words. That meant that I could store the most common words as a bit or two. To uncompress them I have to decode the word to its string of letter codes and the letter codes into actual letters.


Huffman compression

Post 3

The Cow

I realise it's not a perfect match (after all, Morse code predates Huffman by quite a while) but the concept is similar.


Huffman compression

Post 4

Ausnahmsweise, wie üblich (Consistently inconsistent)

Sorry - I was referring to the example in the text of the article. I think it had 10 and 101 as two discrete codes. That cannot happen. The only way you can resolve ambiguity when decoding is because none of the shorter codes has the same beginning sequence as the longer ones.


Huffman compression

Post 5

The Cow

Ah, right.
Yes, of course.

Got any idea what the shortest bit pattern usually is? Because having a 5-bit minimum gives you many less possiblities than a 6-bit minimum.


Removed

Post 6

Ausnahmsweise, wie üblich (Consistently inconsistent)

This post has been removed.


Huffman compression

Post 7

Ausnahmsweise, wie üblich (Consistently inconsistent)

In a very unbalanced population (say, over 50% of the characters were "E") then you can have a single bit code.
What the shortest usually is depends on the spread of the data. I only know about "adaptive" Huffman, which adapts to the current population. I assume there's another form that tries to be more general?


Huffman compression

Post 8

Robbish

Thinking about Huffmann encoding, it is used to compress PC files (steps back in amazement). However, they are not always taken bit by bit. It is sometimes found that if you group the bites into groups of, say, 10 bits, the encoding can work much more effectively.

Also, with text, don't forget that some letters can be totally ignored, thus saving characters. With this little clump of text, you would not have to find an encryption for the letter 'Z'.

Except that I've just included it. But you know what I mean!!!


Huffman compression

Post 9

Ausnahmsweise, wie üblich (Consistently inconsistent)

If you did not find a Z in the first pass when analyzing the population, then the frequency would be zero and it would not be assigned a bit pattern at all.


I've used it at two levels. I had codes for all of the letters in the population, but then I started looking at words too (which would be strings of compressed chars.). In the final compressed data a common word might be just one or two bits. Of course, I had to have more look-up tables to do the uncompress. First I'd decode the word, then I'd have to decode all of its characters. I squeezed more out of it by considering the leading space on a word as part of the word. (I think I used a one bit flag to say whether a word had a leading space or not - that way "THE" and " THE" would share the same Huffman code.)

All of this was used for messages in fire alarm panels, back when ROM was more expensive.

Awu
P.S. I was new to North Anerica and was goning through a dump of a crash. I starting piecing together a compressed message.

H A A G E N D A Z - I thought it was garbage and that I had a bug. It turns out the panel was in a shopping mall and one of the locations was a Haagen Daz ice cream shop. Phewww!


Key: Complain about this post