| • Science | • People | • Locations | • Timeline |
| Contents | ||
Huffman coding uses a specific method for choosing the representations for each symbol, resulting in a prefix-free code (that is, no bit string of any symbol is a prefix of the bit string of any other symbol) that expresses the most common characters in the shortest way possible. It has been proven that Huffman coding is the most effective compression method of this type: no other mapping of source symbols to strings of bits will produce a smaller output when the actual symbol frequencies agree with those used to create the code. (Huffman coding is such a widespread method for creating prefix-free codes that the term "Huffman code" is widely used as a synonym for "prefix-free code" even when such a code was not produced by Huffman's algorithm.)
For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding.
In 1951, David Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree, and quickly proved this method the most efficient.
In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of Shannon-Fano coding by building the tree from the bottom up instead of from the top down.
Given. A set of symbols and their costs..
Find. A prefix binary character code (a sets of codewords) with minimum weighted path length..
Note-1. A code wherein each character is represented by a unique binary string (codeword) is called a binary character code..
Note-2. A prefix code is a code having the property that no codeword is a prefix of any other codeword.
Note-3. In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.
Input.
Alphabet A = {a[1], a[2], ..., a[n]} which is the symbol alphabet of size n.
Set C = {c[1], c[2], ..., c[n]} which is the set of the symbol costs, i.e., c[i] = cost (a[i]), 1 <= i <= n.
Output.
Code H(A,C) = {h[1], h[2], ..., h[n]} which is the set of (binary) codewords, where h[i] is codeword of a[i], 1 <= i <= n.
Goal.
Let S(H) = sum (c[i] * length (h[i])) (1 <= i <= n) be weighted path length of code H. Must be: S(H) <= S(T) for any code T(A,C).
Input.
Symbol alphabet A = {a, b, c, d, e, f, g, h, i },
Set of costs C = {10, 15, 5, 15, 20, 5, 15, 30, 5 }.
Output.
Code H(A,C) = {001, 010, 00000, 011, 101, 00001, 100, 11, 0001}.
Sum S(H) = 10*3 + 15*3 + 5*5 + 15*3 + 20*3 + 5*4 + 15*3 + 30*2 + 5*4 = 355.