Data Compression : Do we need it?

Ani
4 min readNov 12, 2021

If you look at the history of technology over a couple hundred years, it’s all about time compression and making the globe smaller. It’s had positive effects, all the ones that we know. So we’re much less likely to have the kind of terrible misunderstandings that led to World War I, for example. — Eric Schmidt

compress.jpg

From the initial days of 1970s, computer scientists used mathematical algorithms to search through computer code to find ways to reduce the file size. Since then, there has been an ever-growing demand, propelled on by the development of the Internet, to create better compression mechanisms and reduce the size of any given file as much as possible.

Data compression is the process of encoding, restructuring or otherwise modifying data in order to reduce its size. Fundamentally, it involves re-encoding information using fewer bits than the original representation.

Compression is performed by an algorithm to determine how to shrink the size of the data. A good example of this often occurs with image compression. When a sequence of colors, like ‘blue, red, red, blue’ is found throughout the image, the formula can turn this data string into a single bit, while still maintaining the underlying information.

data-transfer.jpg

For data transmission, compression can be performed on the data content or on the entire transmission unit, including header data. When information is sent or received via the internet, larger files, either singly or with others as part of an archive file, may be transmitted in a ZIP, GZIP or other compressed format.

Lossless & Lossy : What is this game?

Compressing data can be a lossless or lossy process. Lossless compression imply that the restoration of a file can be performed without the loss of a single bit of data, when the file is uncompressed. Lossless compression is the typical approach with runnables, as well as text and spreadsheets, where the loss of words or numbers can change the data.

Lossy compression eliminates bits permanently of data that are redundant, unimportant. Lossy compression is useful with images, audio, video and where the removal of some data bits has little or no discernible effect on the representation of the content.

newlede.0.jpg

Graphics image compression can be lossy or lossless. Graphic image file formats are typically designed to compress information since the files tend to be large. JPEG is an image file format that supports lossy image compression. Formats such as GIF and PNG use lossless compression.

Importance of Data Compression

The main advantages of compression are reductions in storage, data transmission time, and communication bandwidth. This helps organisations to experiencce significant cost savings. Compressed files use significantly less storage space than uncompressed files, implying a significant decrease in expenses for storage services. A compressed file also needs less time for transfer while consuming lower bandwidth. This can also help with costs, and also increases productivity.

The main disadvantage of data compression is the bumped up use of computing resources to apply compression to the relevant data. As a consequence, compression vendors give priority to speed and resource optimizations in order to minimize the impact of intensive compression jobs.

Trivia

An-example-of-LZ77-encoding.png

Published in 1977, LZ77 is the algorithm that started it all. It introduced the concept of a ‘sliding window’ for the first time which brought about significant improvements in compression ratio over more primitive algorithms. LZ77 maintains a dictionary using triples representing offset, run length, and a deviating character. The offset is how far from the start of the file a given phrase starts at, and the run length is how many characters past the offset are part of the phrase. The deviating character is just an indication that a new phrase was found, and that phrase is equal to the phrase from offset to offset+length plus the deviating character. The dictionary used changes dynamically based on the sliding window as the file is parsed. For example, the sliding window could be 64MB which means that the dictionary will contain entries for the past 64MB of the input data.

Stay tuned to get more about compression and it’s varieties in big data world. For any type of help regarding career counselling, resume building, discussing designs or know more about latest data engineering trends and technologies reach out to me at anigos.

P.S : I don’t charge money

--

--

Ani

Big Data Architect — Passionate about designing robust distributed systems