06 Apr 2016
On this page:
Huffman encoding is a compression technique developed by David Huffman and published in his 1952 paper 'A Method for the Construction of Minimum-Redundancy Codes'. The technique centres on encoding characters as a list of code words, which are then used to generate a coded version of the original data. Huffman encoding can be very efficient, and should generally yield compression as low as 20% of the original data size.
How it Works
The input data is scanned, and a list of characters and their corresponding weight (the number of instances the character appears in the data). The weights of each character are used to compute the length of the code words that represent that character: larger weights should yield smaller code words, a key aspect of compressing the data. Characters are then arranged into a binary tree, where the resulting layout is used to determine the encoding table. The last step is using the encoding table to create the new file.
Constructing the Binary Tree
Consider the following text:
abccccddddddefghijjjjj
Represented as a list of characters with corresponding weights:
Index Char Weight 0 a 1 1 b 1 2 c 4 3 d 6 4 e 1 5 f 1 6 g 1 7 h 1 8 i 1 9 j 5
This list is now turned into a binary tree. A binary tree is a tree where each node may have a maximum of two child nodes. The left child node is consider the 0 child node, whereas the right child node is considered the 1 child node. The first two nodes are removed from the list, and a new parent node is constructed. This parent node's weight is the combined weight of both children. In our example, after the first iteration our list would look like this:
Index Char Weight 00 null null // used to be a 01 null null // used to be b 02 e 1 03 f 1 04 g 1 05 h 1 06 i 1 07 c 4 08 j 5 09 d 6 10 a + b 2 // new parent node
The next two lowest weight nodes are added to a parent:
Index Char Weight 00 null null // used to be a 01 null null // used to be b 02 null null // used to be e 03 null null // used to be f 04 g 1 05 h 1 06 i 1 07 c 4 08 j 5 09 d 6 10 a + b 2 // new parent node 11 e + f 2 // new parent node
And so on. At the end of this round of processing, the list looks like this:
Index Char Weight 00 null null // used to be a 01 null null // used to be b 02 null null // used to be e 03 null null // used to be f 04 null null // used to be g 05 null null // used to be h 06 null null // used to be i 07 null null // used to be c 08 null null // used to be j 09 null null // used to be d 10 a + b 2 // new parent node 11 e + f 2 // new parent node 12 g + h 2 // new parent node 13 i + c 5 // new parent node 14 j + d 11 // new parent node
Null nodes are removed from the list:
Index Char Weight 0 a + b 2 1 e + f 2 2 g + h 2 3 i + c 5 4 j + d 11
And the next round of processing begins, always taking the smallest weighted nodes and combining them under a new parent. At the end of the next round of processing, the list looks like this:
Index Char Weight 0 null null // a + b 1 null null // e + f 2 null null // g + h 3 null null // i + c 4 j + d 11 5 a + b (2), e + f (2) 4 // new parent node 6 g + h (2), i + c (5) 7 // new parent node
Null nodes are removed from the list:
Index Char Weight 0 a + b (2), e + f (2) 4 1 g + h (2), i + c (5) 7 2 j + d 11
At the end of the next round of processing:
Index Char Weight 0 null 4 // a + b (2), e + f (2) 1 null 7 // g + h (2), i + c (5) 2 j + d 11 // new parent node 3 (a + b, e + f) - (4), (g + h, i + c) - (7) 11 // new parent node
After removing null nodes from the list, we are ready for the final round of processing:
Index Char Weight 0 j + d 11 // new parent node 1 (a + b, e + f) - (4), (g + h, i + c) - (7) 11 // new parent node
And finally, after all processing, there is only one node left:
Index Char Weight
0 (j + d) - 11, (a + b, e + f, g + h, i + c) - (11) 22 // new parent node
This final node forms the root of the tree, and is used to generate a list of binary words that represent each character. Starting at the root of the tree, following a left child node inserts a 0 into the binary word, following a right child node inserts a 1 into the binary word. Using this method, the following list of binary words is created:
Char Weight BinaryWord Int32Word
j 5 00 0 // This column is for sorting purposes only
d 6 01 1
a 1 1000 8
b 1 1001 9
e 1 1010 10
f 1 1011 11
g 1 1100 12
h 1 1101 13
I 1 1110 14
c 4 1111 15
Note that nodes without a character value are not present in this list, but are able to be inferred from the list sorting operations. For the moment, I'll leave you to construct the tree using the information above.
Encoding the Data
The next step is to encode the original data using the binary words created from the previous steps. Moving forward through the data one byte at a time, each character is used to perform a lookup for its corresponding binary word. This binary word is added into a binary word buffer. When the binary word buffer reaches sufficient length, 8 values are read off and removed to construct a new byte. Consider the following:
Index Char BinaryWord ByteBuffer 00 a 1000 // Only 4 bits here, not enough for a new byte 01 b 1001 // Combined with the previous entry, there are enough bits here for a new byte: 10001001 => 137, remove 8 entries from the buffer 02 c 1111 // Start a new byte using these 4 bits 03 c 1111 // 11111111 => 255 04 c 1111 // Start a new byte 05 c 1111 // 11111111 => 255 06 d 01 // Start a new byte 07 d 01 // 0101 - the buffer is only half full 08 d 01 // 010101 09 d 01 // 01010101 => 85 - note that this one byte actually represents 4 chars! 10 d 01 // Start a new byte 11 d 01 // 0101 12 e 1010 // 01011010 => 90 - this one byte represents 3 chars 13 f 1011 // Start a new byte 14 g 1100 // 10111100 => 188 15 h 1101 // New byte 16 i 1110 // 11011110 => 222 17 j 00 // New byte 18 j 00 // 0000 19 j 00 // 000000 20 j 00 // 00000000 => 0 21 j 00 // Start a new byte
With the processing of the final character, the byte buffer only contains 2 values, which is not enough to construct a byte, however as the binary word is 00 it is simply converted to 0. Therefore, the final byte to be constructed is 0. Processing of these 22 characters has yielded a list containing only 9 bytes, about 41% of the original data size.
Unpacking the Data
Decompression (in this example at least) is relatively straight forward. The byte list is read, and each byte converted into a binary word, which is added into a buffer. If the buffer is less than 8 values, then it is 'fixed' (more on that later). Reading each bit from the binary word indicates which path to take through the binary tree (constructed above): starting at the root node, follow the left child for a 0 value, and follow the right for a 1 value. The entire decompression process for the above data:
Index Byte Characters
0 137 => 10001001 a, b
1 255 => 11111111 c, c
2 255 => 11111111 c, c
3 85 => 01010101 d, d, d, d
4 90 => 01011010 d, d, e
5 188 => 10111100 f, g
6 222 => 11011110 h, i
7 0 => 00000000 j, j, j, j
8 0 => 00 j // This is the last value, so rather than pad it out, we can just go directly to the char value
The output data now contains 22 characters, the data has been successfully unpacked.
Final Thoughts
Conveniently, this example ignores completely the size of the encoding table, as well as how storage and retrieval mechanisms, so theoretically 9 bytes is probably as small as we can get. If this compressed data were to be used outside this example, then the binary tree and code words would also need to be stored along with the byte list in order to unpack the data at some other time.
The Code
I've included my somewhat rambling encoding and decoding process, in C# below.
public class HuffmanCompressor { private HuffmanNode oRootHuffmanNode; private List<HuffmanNode> oValueHuffmanNodes; private List<HuffmanNode> BuildBinaryTree(string Value) { List<HuffmanNode> oHuffmanNodes = GetInitialNodeList(); // Update node weights Value.ToList().ForEach(m => oHuffmanNodes[m].Weight++); // Select only nodes that have a weight oHuffmanNodes = oHuffmanNodes .Where(m => (m.Weight > 0)) .OrderBy(m => (m.Weight)) .ThenBy(m => (m.Value)) .ToList(); // Assign parent nodes oHuffmanNodes = UpdateNodeParents(oHuffmanNodes); oRootHuffmanNode = oHuffmanNodes[0]; oHuffmanNodes.Clear(); // Sort nodes into a tree SortNodes(oRootHuffmanNode, oHuffmanNodes); return oHuffmanNodes; } public void Compress(string FileName) { FileInfo oFileInfo = new FileInfo(FileName); if (oFileInfo.Exists == true) { string sFileContents = ""; using (StreamReader oStreamReader = new StreamReader(File.OpenRead(FileName))) { sFileContents = oStreamReader.ReadToEnd(); } List<HuffmanNode> oHuffmanNodes = BuildBinaryTree(sFileContents); // Filter to nodes we care about oValueHuffmanNodes = oHuffmanNodes .Where(m => (m.Value.HasValue == true)) .OrderBy(m => (m.BinaryWord)) // Not really required, for presentation purposes .ToList(); // Construct char to binary word dictionary for quick value to binary word resolution Dictionary<char, string> oCharToBinaryWordDictionary = new Dictionary<char, string>(); foreach (HuffmanNode oHuffmanNode in oValueHuffmanNodes) { oCharToBinaryWordDictionary.Add(oHuffmanNode.Value.Value, oHuffmanNode.BinaryWord); } StringBuilder oStringBuilder = new StringBuilder(); List<byte> oByteList = new List<byte>(); for (int i = 0; i < sFileContents.Length; i++) { string sWord = ""; // Append the binary word value using the char located at the current file position oStringBuilder.Append(oCharToBinaryWordDictionary[sFileContents[i]]); // Once we have at least 8 chars, we can construct a byte while (oStringBuilder.Length >= 8) { sWord = oStringBuilder.ToString().Substring(0, 8); // Remove the word to be constructed from the buffer oStringBuilder.Remove(0, sWord.Length); } // Convert the word and add it onto the list if (String.IsNullOrEmpty(sWord) == false) { oByteList.Add(Convert.ToByte(sWord, 2)); } } // If there is anything in the buffer, make sure it is retrieved if (oStringBuilder.Length > 0) { string sWord = oStringBuilder.ToString(); // Convert the word and add it onto the list if (String.IsNullOrEmpty(sWord) == false) { oByteList.Add(Convert.ToByte(sWord, 2)); } } // Write compressed file string sCompressedFileName = Path.Combine(oFileInfo.Directory.FullName, String.Format("{0}.compressed", oFileInfo.Name.Replace(oFileInfo.Extension, ""))); if (File.Exists(sCompressedFileName) == true) { File.Delete(sCompressedFileName); } using (FileStream oFileStream = File.OpenWrite(sCompressedFileName)) { oFileStream.Write(oByteList.ToArray(), 0, oByteList.Count); } } } public void Decompress(string FileName) { FileInfo oFileInfo = new FileInfo(FileName); if (oFileInfo.Exists == true) { string sCompressedFileName = String.Format("{0}.compressed", oFileInfo.FullName.Replace(oFileInfo.Extension, "")); byte[] oBuffer = null; using (FileStream oFileStream = File.OpenRead(sCompressedFileName)) { oBuffer = new byte[oFileStream.Length]; oFileStream.Read(oBuffer, 0, oBuffer.Length); } // Find the zero node HuffmanNode oZeroHuffmanNode = oRootHuffmanNode; while (oZeroHuffmanNode.Left != null) { oZeroHuffmanNode = oZeroHuffmanNode.Left; } // Unpack the file contents HuffmanNode oCurrentHuffmanNode = null; StringBuilder oStringBuilder = new StringBuilder(); for (int i = 0; i < oBuffer.Length; i++) { string sBinaryWord = ""; byte oByte = oBuffer[i]; if (oByte == 0) { sBinaryWord = oZeroHuffmanNode.BinaryWord; } else { sBinaryWord = Convert.ToString(oByte, 2); } if ((sBinaryWord.Length < 8) && (i < (oBuffer.Length - 1))) { // Pad binary word out to 8 places StringBuilder oBinaryStringBuilder = new StringBuilder(sBinaryWord); while (oBinaryStringBuilder.Length < 8) { oBinaryStringBuilder.Insert(0, "0"); } sBinaryWord = oBinaryStringBuilder.ToString(); } // Use the binary word to navigate the tree looking for the value for (int j = 0; j < sBinaryWord.Length; j++) { char cValue = sBinaryWord[j]; if (oCurrentHuffmanNode == null) { oCurrentHuffmanNode = oRootHuffmanNode; } if (cValue == '0') { oCurrentHuffmanNode = oCurrentHuffmanNode.Left; } else { oCurrentHuffmanNode = oCurrentHuffmanNode.Right; } if ((oCurrentHuffmanNode.Left == null) && (oCurrentHuffmanNode.Right == null)) { // No more child nodes to choose from, so this must be a value node oStringBuilder.Append(oCurrentHuffmanNode.Value.Value); oCurrentHuffmanNode = null; } } } // Write out file string sUncompressedFileName = Path.Combine(oFileInfo.Directory.FullName, String.Format("{0}.uncompressed", oFileInfo.Name.Replace(oFileInfo.Extension, ""))); if (File.Exists(sUncompressedFileName) == true) { File.Delete(sUncompressedFileName); } using (StreamWriter oStreamWriter = new StreamWriter(File.OpenWrite(sUncompressedFileName))) { oStreamWriter.Write(oStringBuilder.ToString()); } } } private static List<HuffmanNode> GetInitialNodeList() { List<HuffmanNode> oGetInitialNodeList = new List<HuffmanNode>(); for (int i = Char.MinValue; i < Char.MaxValue; i++) { oGetInitialNodeList.Add(new HuffmanNode((char)(i))); } return oGetInitialNodeList; } private static void SortNodes(HuffmanNode Node, List<HuffmanNode> Nodes) { if (Nodes.Contains(Node) == false) { Nodes.Add(Node); } if (Node.Left != null) { SortNodes(Node.Left, Nodes); } if (Node.Right != null) { SortNodes(Node.Right, Nodes); } } private static List<HuffmanNode> UpdateNodeParents(List<HuffmanNode> Nodes) { // Assign parent nodes while (Nodes.Count > 1) { int iOperations = (Nodes.Count / 2); for (int iOperation = 0, i = 0, j = 1; iOperation < iOperations; iOperation++, i += 2, j += 2) { if (j < Nodes.Count) { HuffmanNode oParentHuffmanNode = new HuffmanNode(Nodes[i], Nodes[j]); Nodes.Add(oParentHuffmanNode); Nodes[i] = null; Nodes[j] = null; } } // Remove null nodes Nodes = Nodes .Where(m => (m != null)) .OrderBy(m => (m.Weight)) // Choose the lowest weightings first .ToList(); } return Nodes; }
Here's the code for the HuffmanNode type itself:
public class HuffmanNode { public HuffmanNode() { } public HuffmanNode(char Value) { cValue = Value; } public HuffmanNode(HuffmanNode Left, HuffmanNode Right) { oLeft = Left; oLeft.oParent = this; oLeft.bIsLeftNode = true; oRight = Right; oRight.oParent = this; oRight.bIsRightNode = true; iWeight = (oLeft.Weight + oRight.Weight); } private string sBinaryWord; private bool bIsLeftNode; private bool bIsRightNode; private HuffmanNode oLeft; private HuffmanNode oParent; private HuffmanNode oRight; private char? cValue; private int iWeight; public string BinaryWord { get { string sReturnValue = ""; if (String.IsNullOrEmpty(sBinaryWord) == true) { StringBuilder oStringBuilder = new StringBuilder(); HuffmanNode oHuffmanNode = this; while (oHuffmanNode != null) { if (oHuffmanNode.bIsLeftNode == true) { oStringBuilder.Insert(0, "0"); } if (oHuffmanNode.bIsRightNode == true) { oStringBuilder.Insert(0, "1"); } oHuffmanNode = oHuffmanNode.oParent; } sReturnValue = oStringBuilder.ToString(); sBinaryWord = sReturnValue; } else { sReturnValue = sBinaryWord; } return sReturnValue; } } public HuffmanNode Left { get { return oLeft; } } public HuffmanNode Parent { get { return oParent; } } public HuffmanNode Right { get { return oRight; } } public char? Value { get { return cValue; } } public int Weight { get { return iWeight; } set { iWeight = value; } } public override string ToString() { StringBuilder oStringBuilder = new StringBuilder(); if (cValue.HasValue == true) { oStringBuilder.AppendFormat("'{0}' ({1}) - {2} ({3})", cValue.Value, iWeight, BinaryWord, BinaryWord.BinaryStringToInt32()); } else { if ((oLeft != null) && (oRight != null)) { if ((oLeft.Value.HasValue == true) && (oRight.Value.HasValue == true)) { oStringBuilder.AppendFormat("{0} + {1} ({2})", oLeft.Value, oRight.Value, iWeight); } else { oStringBuilder.AppendFormat("{0}, {1} - ({2})", oLeft, oRight, iWeight); } } else { oStringBuilder.Append(iWeight); } } return oStringBuilder.ToString(); } }
And finally, an extension method that is also required for the code to compile:
public static int BinaryStringToInt32(this string Value) { int iBinaryStringToInt32 = 0; for (int i = (Value.Length - 1), j = 0; i >= 0; i--, j++) { iBinaryStringToInt32 += ((Value[j] == '0' ? 0 : 1) * (int)(Math.Pow(2, i))); } return iBinaryStringToInt32; }
To get this code running, it will need to be stitched together, perhaps in a console application:
public static void Main(string[] Arguments) { HuffmanCompressor oHuffmanCompressor = new HuffmanCompressor(); oHuffmanCompressor.Compress("Huffman.txt"); Console.WriteLine(); oHuffmanCompressor.Decompress("Huffman.txt"); Console.WriteLine(); }
Note that for this example, decompression is only possible if the binary tree has been constructed, i,e., the data has already been compressed.
Copyright © 2025 carlbelle.com