Huffman encoding¶
Huffman encoding is a classical lossless data compression algorithm designed to reduce the size of data without losing any information. Its fundamental idea is to represent each symbol (such as a character in a text file) with a unique variable-length binary code, where the most frequent symbols are assigned the shortest codes and the least frequent symbols get longer codes. This efficiently reduces the total number of bits needed to encode a message, especially in data where some symbols appear much more often than others.
Class to represent nodes of the huffman tree¶
In [1]:
class Node:
def __init__(self, character_: str, frequency_: int):
self.left = None
self.right = None
self.char = character_
self.freq = frequency_
# < operator overloading
def __lt__(self, other):
return self.freq < other.freq
# > operator overloading
def __gt__(self, other):
return self.freq > other.freq
# equal operator overloading
def __eq__(self, other):
return self.freq == self.freq
Algorithm¶
1. Create Node Classes for Huffman Tree¶
- Each node has a character and its frequency.
- The node may point to a left and right child node.
- Operator overloading (lt, gt, eq) is used for maintaining node ordering in a heap (priority queue).
2. Populate a Min-Heap of Leaf Nodes¶
- For every unique character, create a node with its frequency.
- Push all nodes into a min-heap, so nodes with the lowest frequency have the highest priority.
3. Build the Huffman Tree¶
- While there are more than one nodes in the heap:
- Extract the two nodes with the lowest frequency.
- Create a new internal node whose frequency is the sum of the two extracted nodes.
- Set the two extracted nodes as the left and right children of the new node.
- Insert the new node back into the min-heap.
- After the loop, the heap contains only one node—the root of the Huffman Tree.
4. Traverse the Huffman Tree to Assign Codewords¶
- Start from the root node.
- Traverse left by appending '0' to the codeword, and right by appending '1'.
- When a leaf node is reached (character node), assign the accumulated binary codeword to that character.
- This is commonly implemented as a recursive preorder traversal.
5. Store and Return the Encodings¶
- Store all character-to-codeword mappings in a dictionary (charMap).
- Return this mapping as the result.
Function to traverse tree in preorder¶
In [2]:
from typing import List, Dict
def preOrder(root, charMap_: Dict, curr = ""):
if root is None:
return
# Leaf node represents a character.
if root.left is None and root.right is None:
charMap_[root.char] = curr
return
preOrder(root.left, charMap_, curr + '0')
preOrder(root.right, charMap_, curr + '1')
In [3]:
import heapq
def huffmanCodes(s, freq: List[int], chars: List[str]) -> Dict[str, str]:
# Code here
n = len(s)
# Min heap for node class.
nodeHeap = []
for i in range(n):
heapq.heappush(nodeHeap, Node(chars[i], freq[i]))
# Construct the huffman tree.
while len(nodeHeap) >= 2:
# Left node
l = heapq.heappop(nodeHeap)
# Right node
r = heapq.heappop(nodeHeap)
newNode = Node("", l.freq + r.freq)
newNode.left = l
newNode.right = r
heapq.heappush(nodeHeap, newNode)
root = heapq.heappop(nodeHeap)
charMap = {}
preOrder(root, charMap)
return charMap
Driver code¶
In [4]:
if __name__ == "__main__":
s = "abcdef"
freq = [50, 10, 30, 5, 3, 2]
ans = huffmanCodes(s, freq, list(s))
print(ans)
{'a': '0', 'b': '100', 'd': '1010', 'f': '10110', 'e': '10111', 'c': '11'}
Visualizing the Huffman Tree¶
graph TD
N1(100)
a(a: 50)
N2(50)
N3(20)
c(c: 30)
b(b: 10)
N4(10)
d(d: 5)
N5(5)
f(f: 2)
e(e: 5)
N1 --> a
N1 --> N2
N2 --> N3
N2 --> c
N3 --> b
N3 --> N4
N4 --> d
N4 --> N5
N5 --> f
N5 --> e
Note: The smaller value is on the left and larger value is on the right.