Honey, I shrunk the data: an introduction to data compression

We all know the dilemma. We take our suitcase and pack for our upcoming trip. The suitcase looks huge when we start packing. But soon, we start filling it up. The suitcase suddenly looks full and crammed. There are still a few things that we would like to carry, but there seems to be no space. We need to do two things now: pack the existing things compactly so that they take up less space and creates more space for new additions, or take out some of the existing things which we can do without and create space for new additions. We face a similar dilemma in computing. We encounter low disk space or are stuck with low bandwidth and have huge files to transfer. How do we solve the problem? The answer is data compression.

The principle behind data compression

We all like to fit more value in less. The idea of data compression is exactly that. It solves the challenge of fitting more data into less number of bytes, so that we get more bang for the buck in terms of storage and bandwidth costs. The key principle behind data compression is ‘Remove whatever is unnecessary and compact the data as much as possible’. Based on what is deemed unnecessary, there are two basic approaches to data compression: Lossy and Lossless.

The names mean exactly what they say. The lossless method compresses data without losing any in the process. This can be likened to folding clothes more compactly in a suitcase, so that unnecessarily occupied space is freed up. Lossy method on the other hand deliberately discards data that is not useful to the other end or can be inferred or re-constructed despite the missing chunks. E.g. if we were to discard our shaving kit from our suitcase and were to travel to a well-equipped city, then we can always avail of a barber’s services.

Let us look at how each compression method works in practice and analogies.

Lossless compression

As mentioned in the previous section, lossless compression methods do not discard data, but they find ways in which data can be compacted by removing unnecessarily occupied space.

Compression by re-evaluating space

The first of the two methods is to evaluate if a piece of data can take less space than what the standard data storage practices follow. That sentence was quite a mouthful, but let us simplify it.

When we buy a standard notebook, we typically pick one off the shelf in one of the standard sizes, e.g. Letter, A4, A5, etc. Let us say that there are 100 pages in the notebook. Each of the pages is exactly the same size. In my case, my notebook page has 23 ruled lines per page. But have you ever thought about how much you actually write on? Out of 23 ruled lines, we probably use about 13-14. We leave blank lines for formatting. If it happens to be the last page of our content, the rest of the page is left unused. Look at the image below, where I have used just 3 and a half lines out of 23 on a standard notebook page. 20 lines are useless for this content’s purposes.

001-page-umcompressed

Now let us say that we have set ourselves a challenge of being able to fit all our writings from a notebook into something that will fit our pocket. Technically it is possible, but we need to work on it. E.g. we can use another smaller pocket notebook and start writing all our content end-to-end with minimal line spacing and by not giving page breaks between different pieces of content. Alternatively, we can start cutting out paper bits from our original notebook and start clipping them on a large foldable sheet of paper and then keep the sheet folded in our pocket. Either way, we have compressed the content of our notebook into a pocket-fit form-factor.

002-page-compressed

The important point above is that the notebook maker made a one-size fits all solution, but you can plan your way to carefully compress your content.

Computers use data compression in similar fashion. Typically each letter that we type occupies a whopping 2 bytes of memory! Why? Well, a computer has no way of knowing which characters you are going to type or even which language you are going to use. To accommodate the possibility of receiving any character from any language supported in computing today, a computer must make way for more than 60000 different characters, which means that each character requires to be 16 bits or 2 bytes long. This is like the notebook maker who has made a generic A4 sized notebook of 100 pages.

However let us say that we have typed a standard English document and used the standard small and capital alphabets, numbers and special characters such as brackets, hyphens, spaces, etc. The number of possible characters now goes down to around 60, which means that each character can be represented in only 6 bits, which is less than 1 byte! When a lossless data compression algorithm goes through the document, it realises what we just talked about and starts hacking away at the unnecessary bits in every character. Each character is reduced from 16 bits to just 6 bits and we have achieved a reduction by a factor of 3/8, i.e. what used to occupy 8 storage units, now does only 3 units.

Compression by clubbing repeated data

Take the following phone number: 9813322257
Here is one way to say it: Nine Eight One Three Three Two Two Two Five Seven
And here’s another: Nine Eight One (Double Three) (Triple Two) Five Seven
When you speak it out, you will find that the second way takes far less time to say.

And let’s picture this too. There may be 5 persons eating at a restaurant, but if three of the persons have ordered coffee, we would never say, “coffee for him, coffee for her, coffee for her”. It would always be, “Three coffees”. Clubbing repeated information to say, “n times this” or “n of this” is a very powerful way to compress things down.

Data compression algorithms are excellent at clubbing repeated data. While text documents rarely see repeating patterns, sound and image data do exhibit such behaviour in big chunks and they hugely benefit from this type of compression. We will look at detailed examples later in our case study section.

Lossy compression

Lossy compression relies on the fact that most data also has noise or irrelevant pieces that can be discarded so that the remaining data is still completely usable and makes sense. Lossy compression, however, must be extremely smart about what to discard and is typically harder to program. Instead of pure maths, lossy algorithms leverage other things like human behaviour, senses and context to get rid of irrelevant data.

Let us see a very simple example.
Question: “What colour is the shirt you are wearing?”
Answer 1: “The colour of the shirt that I am wearing is red.”
Answer 2: “Red”

As you can see, answer 2 is one-worded, but it completely conveys the meaning that answer 1 does. Here, context is used to compress our answer, since the continuity of the conversation is implied. However, it may not always work, as in.

Person 1 over phone: “What are you wearing to the party today?”
Person 2: “I am wearing my red evening gown.”
Person 2: “Red”

Here it makes no sense to compress the answer to one word since the term ‘red’ does not fully convey the complete message. The colour is conveyed, but not the type of dress.

Lossy data compression works on a principle similar to this. It evaluates if certain pieces of data can be discarded, but the remaining pieces would still retain the utility.

Data compression case study

Two of the best places where lossy and lossless data compression are used together are JPEG and MP3.

In JPEG images, continuous blocks of images having similar colour are converted to exactly the same shade. These changes are generally undecipherable to the human eye. E.g. In photos, a camera lens often capture different shades of blue in the sky, but the human eye is unable to spot most of the differences, except for the ones at the horizon. JPEG uses this human limitation to flatten out the colours.

This is the lossy part. Once multiple colours are flattened to one, they cannot be converted to multiple colours again. Why? Think of a simple example in mathematics. The average of 3 and 5 is 4. However we cannot get back the individual numbers if merely given an average of 4. 4 could be the average of 1 and 7, so is it for 0, 1, 6, 9. We cannot even get back how many samples were averaged out, let alone what the individual numbers were.

Once this is done, the lossless method of repetition reduction method as described earlier is used. E.g. Draw a 1024 x 50 pixels block of this sky blue shade starting from the top of the image. That is about 150,000 bytes of individual pixel data reduced to 15 bytes, 3 bytes to store the colour and 4 bytes to store how many times the colour is to be repeated, 4 bytes for the place where to start shading and 4 for the place where to stop. This is a reduction by a factor of 10,000!

In MP3, all the sounds captured by the mic, but not audible to the human ears are first removed. Sound notes that are too similar for the human ear to distinguish are averaged out into the same note. That is the lossy part. After this, repeating patterns in the sound are sought. E.g. a guitar riff that repeats throughout the song or the drum rhythm that forms the bass. One can extract the patterns and specify how many times it is to be repeated in the track, given the starting and ending timings or the starting time and the number of repeats. E.g. This drum rhythm is to be played repeatedly from the 38th second to the 140th second of the sound track, or this keyboard sequence is to be played from the 41st second and repeated 30 times.

After performing the lossy and lossless methods, incredible amounts of compression is attained, but the image or the sound makes perfect sense to the user who consumes the content.

Conclusion

As we can see, with increasing ability of computers to handle different types of content and the bigger and heavier they get, it is absolutely vital to have a solid data compression plan in place, so that content can be stored in as little space as possible and can be transferred across networks while incurring as little cost and delay as possible.

How have you used data compression for reducing time, space, bandwidth and costs in your project? What kind of compression due you use and how much does it save you? How much infrastructure is it able to reduce for you and what does it make possible for you that you couldn’t have imagined before?

5 thoughts on “Honey, I shrunk the data: an introduction to data compression”

Leave a Reply

Your email address will not be published. Required fields are marked *