Run Length Encoding is a lossless data compression algorithm. It compresses data by reducing repetitive, and consecutive data called runs. It does so by storing the number of these runs followed by the data.
In this article, we will learn more about Compression algorithms, dive deep into implementing RLE algorithm and understand its performance.
Table of contents
- Types of compression algorithms
- Time Complexity
- Further reading
To follow along with this tutorial, the reader should have:
- A basic understanding of Python.
- Familiarity with a notebook-based interface like Google Colab. Normal Python files could be used as well.
Before we understand RLE, let’s have a look at few examples:
- For the text
AAAAAAAAAAAAAHHHEEM(19 characters), RLE will encode it to
- For the text
AAAAHHHEEM, HAHA., it will be encoded as
4A3H2E1M1,1 1H1A1H1A1.(21 characters).
From these examples, we see that RLE is suitable for compressing large amounts of data with a few runs e.g., image pixel information. This is because short runs may end up taking the same space or more as we see in the second example.
This is a process where a file size is reduced using algorithms resulting in a file that uses fewer storage bits than the original file.
- In a situation where one wants to send a picture to his/her friend, compression will be done at the source device and decompressed at the destination device. This enables the file to be sent faster, and it reduces overhead traffic for transmitting the data.
- In social media apps like WhatsApp, we notice that the image received is of lower quality and consumes much lesser space. For a few apps, we can set the quality of media to be downloaded.
- We also notice that compression in video streaming sites, where videos are loaded in low quality during poor internet connectivity.
Types of compression algorithms
There two main classification types for compression algorithms are:
1. Lossless algorithm
Lossless algorithms are used when information quality is very important. We try to avoid the loss of image quality.
These processes are reversible, and they’ve very low compression ratios since we don’t lose any information.
Compression ratio - We get this ratio by dividing the size before compression and size after compression. (
Size before compression/Size after compression).
You can read more about Huffman coding here.
2. Lossy algorithm
We use lossy algorithms where quality could be compromised. Here, we achieve high compression ratios, hence greater size reduction.
The lossy compression process is non-reversible, unlike the one for lossless compression algorithms.
Let’s jump into the Python code.
Create a new notebook and add the following code to the first cell.
def encode_message(message): encoded_string = "" i = 0 while (i <= len(message)-1): count = 1 ch = message[i] j = i while (j < len(message)-1): '''if the character at the current index is the same as the character at the next index. If the characters are the same, the count is incremented to 1''' if (message[j] == message[j + 1]): count = count + 1 j = j + 1 else: break '''the count and the character is concatenated to the encoded string''' encoded_string = encoded_string + str(count) + ch i = j + 1 return encoded_string
The above snippet contains a function
encode_message() with a parameter
message (string) to be encoded.
encoded_string is used for storing the encoded string. The
while loop is initialized with the count of
1 (one-indexed), and looped through the whole message
while loop checks if the character at the current index is the same as the character at the next index.
If the characters are the same, the count is incremented to
1. If not, the count and the character are concatenated to the
encoded_string variable and returned.
Next, we have to write a function for decoding the encoded messages. Create a new cell and paste this code:
def decode(our_message): decoded_message = "" i = 0 j = 0 # splitting the encoded message into respective counts while (i <= len(our_message) - 1): run_count = int(our_message[i]) run_word = our_message[i + 1] # displaying the character multiple times specified by the count for j in range(run_count): # concatenated with the decoded message decoded_message = decoded_message+run_word j = j + 1 i = i + 2 return decoded_message
decode() takes in the parameter
our_message containing the encoded message. The variable
decoded_message stores the decoded string.
while loop is used for separating out the encoded message into respective counts
run_count, and the characters
run_word, until it is finished.
It contains a
for loop for displaying the character multiple times specified by the count (
run_count) to form a string. The string is then concatenated with the
decoded_message variable. This continues until the decoded string is displayed in full.
Finally, we write the method for displaying what we have written. Again, open a new cell and add the following:
def display(): # the original string our_message = "AuuBBBCCCCCCcccccCCCCCCCCCA" # pass in the original string encoded_message=encode_message(our_message) # pass in the decoded string decoded_message=decode_message(encoded_message) print("Original string: [" + our_message + "]") print("Encoded string: [" + encoded_message +"]") print("Decoded string: [" + decoded_message +"]")
display(), has the string to be encoded and decoded.
We then call the two methods for the process. We pass in the original string to the
encode_message() function. After encoding, it is stored in the
The encoded message,
encoded_message, is passed to the
decode_message function for decoding. Thereafter, the decoded message is stored in the
The output will be as shown below:
Original string: [AuuBBBCCCCCCcccccCCCCCCCCCA] Encoded string: [1A2u3B6C5c9C1A] Decoded string: [AuuBBBCCCCCCcccccCCCCCCCCCA]
The Colab notebook for this code is found here.
The algorithm has
O(n) complexity compared to other lossless algorithms like Huffman with a complexity of
O(nlogn), which is computationally more expensive than RLE.
RLE performs poorly on large amounts of data. This is clearly explained in this research paper.
You can learn about calculating time and space complexities here.
To conclude, we have learned what compression is, the types of compression, the Run Length Encoding algorithm, and its implementation in Python.
Peer Review Contributions by: Srishilesh P S