Overview of data compression and examples of various algorithms and implementations

Machine Learning Artificial Intelligence Algorithm Digital Transformation Mathematics Programming Encryption, Security and Data Compression Navigation of this blog

Data Compression

Data compression is the process of reducing the size of data in order to represent information more efficiently. The main purpose of data compression is to make data smaller, thereby saving storage space and improving data transfer efficiency.

There are two main types of data compression techniques: lossless compression and lossy compression.

Lossless compression guarantees an exact reproduction of the original data when compressing data, and the original data is restored when the compressed data is decompressed. Typical methods of lossless compression include run-length compression, Huffman coding, and Lempel-Ziv-Welch (LZW).

Lossy compression, on the other hand, sacrifices some information when compressing data, making it impossible to fully reproduce the original data. However, in many cases, data can be compressed with little effect on human perception. Typical methods of lossy compression include JPEG (image compression), MP3 (audio compression), and H.264 (video compression).

The effectiveness of data compression depends on the type of data and the compression method, with some data compressing very well while others may compress very little. Also, some compression methods may be effective for certain data while not applicable to others.

Data compression is widely used in a variety of situations, including data backup, data transfer over networks, and Internet content distribution, and the use of effective data compression techniques can result in efficient data storage and transfer, saving resources and improving performance. The use of effective data compression methods can save resources and improve performance by efficiently storing and transferring data.

Data Compression Algorithms

There are various types of data compression algorithms. Typical algorithms are described below.

Run-Length Encoding (RLE)

Run-length compression is a method of compressing data by recording the number of occurrences of the same data and their values when they occur in succession. Run-length compression is a very simple compression technique that is especially useful when there are sequential patterns or frequently repeated data in the data.

In run-length compression, data is scanned sequentially and compressed by counting consecutive blocks of the same data. Specifically, successive blocks of data are expressed in the form “number of data occurrences + data value.

For example, if the string “AAAABBBCCD” is run-length compressed, it becomes “4A3B2C1D,” where “A” appears four times in succession, “B” three times in succession, “C” twice in succession, and “D” only once. Run-length compression is considered a type of lossless compression because the compressed data can accurately reproduce the original data.

Run-length compression is particularly effective for data such as image data and audio data, which tend to be continuously followed by the same value. For example, run-length compression is very effective for single-color image data, and some image file formats also use run-length compression.

However, run-length compression is not effective for data in which the same values do not appear consecutively; rather, the data size after compression may increase, and since run-length compression depends on the type of data, its use is limited as a general data compression method.

Huffman Coding:

Huffman coding is a method of efficiently compressing data by assigning bit patterns based on the frequency of data occurrence. Huffman coding is a type of lossless compression that accurately reproduces data.

Huffman coding assigns shorter bit patterns to data that appears more frequently and longer bit patterns to data that appears less frequently, based on the frequency and probability of data appearance. Since shorter code words are assigned to more frequently appearing data, the overall number of data bits can be reduced.

The Huffman coding procedure is as follows

1. Calculate the frequency of data occurrence. Scan the input data set and count the number of occurrences of each data.
2. Based on the frequency of occurrence, represent the data in a tree structure.
3. Data with a higher frequency of occurrence are placed higher in the tree, and data with a lower frequency of occurrence are placed lower in the tree.
4. The left side of the tree is assigned the bit “0” and the right side is assigned the bit “1”. The higher the data is placed in the tree, the shorter bit pattern is assigned.
5. For each data, the corresponding bit pattern is generated by following the tree. This allows data to be represented in a compressed format.

Huffman coding can achieve effective compression by taking advantage of the statistical properties of the data. Since data with a high frequency of occurrence is represented by short bit patterns, it is an effective compression method when data is not random or when certain patterns are present. In particular, it is widely used for common data formats such as text data and image data.

As with data compression, Huffman encoding can also be used during decompression to restore the original data by using a tree to return bit patterns to the data, and is an important method for achieving efficient data compression, used in many compression algorithms and formats (e.g., ZIP, GZIP) It can be

Lempel-Ziv-Welch (LZW) Compression:

LZW compression is an effective lossless compression algorithm used to compress text and general data. The steps of LZW compression are as follows

1. Create initial dictionary: First, an initial dictionary is created. The initial dictionary contains a single character or a fixed-length phrase.
2. Scanning the data: The input data is scanned to see if the current phrase plus the next character exists in the dictionary. If present, the current phrase is made into a new phrase by appending the next character to the current phrase, and the scan continues. If it does not exist, the current phrase is registered in the dictionary and the code of the phrase is output.
3. Updating the dictionary: Phrases registered during the scan are added to the dictionary and used to search for the next phrase. When a new phrase is added to the dictionary, the dictionary entry is expanded with the code number.
4. Compressed Data Output: When scanning is complete, the codes for each phrase are output as compressed data.

LZW compression effectively compresses data by registering duplicate phrases in the data in a dictionary, and is particularly effective in compressing text data and file formats, such as those used in ZIP and GIF formats.

As well as compressing data, LZW compression also uses a dictionary during decompression to restore the original data by putting the code back into the data, and is sometimes used in conjunction with run-length compression or Huffman coding, and improved variations exist to achieve high compression efficiency.

File Archiver:

A file archiver is a tool or format that also compresses data when multiple files or directories are combined into a single archive file. Using a file archiver, file sizes can be reduced by compressing data, and multiple files can be efficiently stored and transferred.

One typical file archiver is the ZIP format, which is a file format that simultaneously compresses and archives data using the Lempel-Ziv (LZ77) compression algorithm and Huffman coding. LZ77 here refers to one that detects data duplication and references duplicate locations to achieve efficient compression, while Huffman coding refers to one used to assign bit patterns to the compressed data.

Examples of other file archivers include RAR and 7-Zip. These formats also aim to compress data and use a combination of various algorithms (LZ77, LZMA, Bzip2, etc.) to achieve high compression ratios. These formats offer a variety of options for selecting compression algorithms and adjusting compression levels.

In addition to compressing files, file archivers offer a variety of functions, such as preserving file and directory structure, setting passwords, adding comments, error detection and repair, and splitting and merging. There is also a trade-off between compression efficiency and processing speed, depending on the algorithm and compression level chosen for the file archiver.

Fractal Compression:

Fractal compression is a method for efficiently compressing data, such as images and video, and it does so by taking advantage of a property of data called self-similarity.

In fractal compression, data is compressed by dividing the data into small portions (sub-blocks), comparing each sub-block with other portions of the original data to find similarities, and recursively replacing the sub-blocks that have similarities. The data is compressed by recursively replacing the subblocks that are similar.

The fractal compression procedure is as follows

1. Image segmentation: The original data (e.g., an image) is divided into smaller sub-blocks.
2. Subblock Matching: The segmented subblocks are compared with other parts of the original data to find subblocks with high similarity.
3. Substitution: Subblocks with high similarity are replaced with other parts of the original data. This substitution operation reduces the amount of information in the original data.
4. Recursive Processing: Recursive processing is performed on the replaced sub-blocks. The replaced subblock is also divided into subblocks, and the same procedure is repeated.
5. Compressed data storage: Compressed data (e.g., subblock replacement information) is stored.

Fractal compression is effective for data with self-similarity, and may be especially applied to natural scenes and textures such as images and videos, as these often exhibit self-similarity.

Although fractal compression can achieve very efficient compression, the compression and decompression process is computationally expensive and may take a long time. Also, since fractal compression is not lossless compression, the original data cannot be completely recovered from the compressed data. Therefore, some information and details may be lost.

Fractal compression takes a different approach than common data compression algorithms (e.g., run-length compression, Huffman coding) and may be highly effective for certain data sets.

Libraries and platforms used for data compression

Various libraries and platforms are used for data compression. Some of the most common are listed below.

zlib: zlib is a general library for data compression and decompression, specifically implementing the Deflate compression algorithm. zlib is implemented in C and is used by many programming languages and applications.
gzip: gzip will be a data compression utility commonly used on UNIX-like systems. gzip is based on zlib and is provided as a command line tool for file compression and decompression.
ZIP Library: There are also a number of libraries that support data compression in ZIP format. For example, Java’s java.util.zip package and Python’s zipfile module.
Snappy: Snappy is a fast data compression library developed by Google that focuses on efficient use of CPU and memory and is often used in situations where speed of compression and decompression is required.
LZ4: LZ4 is a library that provides fast data compression algorithms. It is often used for real-time data processing and high-speed data transfer.
Libraries such as Snappy and LZ4 are also embedded in frameworks and platforms. For example, distributed data processing platforms such as Apache Hadoop and Apache Kafka use these compression libraries to compress and decompress data.

Data compression and blockchain technology

Data compression and blockchain technology have different purposes and applications, but several relationships and possible combinations exist.

Reducing block size through data compression: A blockchain is a distributed ledger in which multiple nodes hold copies of data. In a blockchain, the block size can affect network scalability and data transmission, and the use of data compression can reduce block size and improve network efficiency.
Application of compression algorithms: Data on the blockchain is protected using cryptographic techniques such as hash functions and digital signatures. Some compression algorithms use characteristics of the data to extract unique features and compress certain patterns of data, and these features can be used to protect and efficiently store data on the blockchain.
Markle Trees and Compression: Markle trees, a blockchain data structure, will be used to check the integrity of transactional data. A Markle tree combines multiple pieces of data with a hash function to create a unique hash, which compresses the data, thereby reducing the size of the Markle tree.
Storage savings through data compression: Blockchains need to be stored for long periods of time to ensure data persistence. Data compression reduces storage usage and can be an effective way to store blockchains over time.

Example python implementation of data compression

There are various algorithms for data compression, and here we describe a Python implementation of Huffman coding, one of the most common data compression algorithms. Huffman coding is a method to efficiently compress data by assigning short bit strings to characters and symbols.

Below is a simple example of Huffman coding implementation for compressing text data.

import heapq
from collections import defaultdict, Counter

class HuffmanNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):
        return self.freq < other.freq def build_huffman_tree(text): # Calculate the frequency of occurrence of a letter 
　　　　　　　　　　　　　　　　char_freq = Counter(text) # Create nodes based on frequency of occurrence and add to prioritized queue 
　　　　　　　　　　　　　　　　priority_queue = [HuffmanNode(char, freq) for char, freq in char_freq.items()] heapq.heapify(priority_queue) # Join nodes to build a Huffman tree 
　　　　　　　　　　　　　　　　while len(priority_queue) > 1:
        left_node = heapq.heappop(priority_queue)
        right_node = heapq.heappop(priority_queue)
        
        merged_node = HuffmanNode(None, left_node.freq + right_node.freq)
        merged_node.left = left_node
        merged_node.right = right_node
        
        heapq.heappush(priority_queue, merged_node)
    
    return priority_queue[0]

def build_huffman_codes(node, current_code="", huffman_codes={}):
    if node is None:
        return
    
    if node.char is not None:
        huffman_codes[node.char] = current_code
        return

    build_huffman_codes(node.left, current_code + "0", huffman_codes)
    build_huffman_codes(node.right, current_code + "1", huffman_codes)

def compress_text(text):
    huffman_tree = build_huffman_tree(text)
    huffman_codes = {}
    build_huffman_codes(huffman_tree, "", huffman_codes)
    
    compressed_text = "".join(huffman_codes[char] for char in text)
    return compressed_text, huffman_codes

def decompress_text(compressed_text, huffman_codes):
    reverse_huffman_codes = {code: char for char, code in huffman_codes.items()}
    current_code = ""
    decompressed_text = ""
    
    for bit in compressed_text:
        current_code += bit
        if current_code in reverse_huffman_codes:
            char = reverse_huffman_codes[current_code]
            decompressed_text += char
            current_code = ""
    
    return decompressed_text

# Data for testing
if __name__ == "__main__":
    original_text = "this is an example for huffman encoding"
    
    compressed_text, huffman_codes = compress_text(original_text)
    print("Compressed text:", compressed_text)
    
    decompressed_text = decompress_text(compressed_text, huffman_codes)
    print("Decompressed text:", decompressed_text)

The code compresses the given text data using the Huffman code and then decompresses it to recover the original text. The execution result is as follows.

Compressed text: 0110111101101011100100110010001010111011111010011100110111111011110110111110101110011110010111010000110110010011000001001111010111010010010110000111011001111000111101000100100010010101011100010111110
Decompressed text: this is an example for huffman encoding

Reference Information and Reference Books

More information on data compression can be found in “Encryption and Security Techniques and Data Compression Techniques.

Reference book is “Understanding Compression: Data Compression for Modern Developers”

“Handbook of Data Compression“

“Introduction to Data Compression”