An introduction to compression.

27 July 2022

If I present you with a text file with some simple sentence and asked you to make it take up less space, without changing the content… You’d likely tell me that was an impossible task. Yet, that’s exactly what lossless compression does.

So How does compression work

There are two types of compression, lossless and lossy. Lossy compression works by removing minimal information to shrink a file. This is the equivilent of using conjunctives in my text file:

Hi, how are you?

becomes

Hi, how're you?

Far more interesting is lossless compression, which is able to reduce the size of a file without losing a single bit of information.

Lossless compression

So, how can this possibly work? Well, lossless compression attacks flaws in the file format, rather than the information in the file. For instance, Our text file only uses a few characters. For instance,

Hi, how are you?

contains 13 unique characters, which can be stored in 4 bits, half the size of a typical ASCII character. This means that we can reduce the size of the text by half without losing any information.

The classic example is a lossless PNG file, which is able to reduce the size of an image by up to 847 times.

A simple implementation of image compression

Perhaps the simplest way to reduce the size of an image file is to collapse duplicate pixels. For instance, many images are set on a black background. By creating a format that allows us to remove duplicate pixels we can hugely reduce file size.


#define MAGIC 'SUKC'

typedef struct {
	uint32_t magic; // 'SUKC'.. lol
	uint32_t width; // image width
	uint32_t height; // image height
} synackuk_compression_header_t;

typedef struct {
	uint8_t r; // red pixel value
	uint8_t g; // blue pixel value
	uint8_t b; // green pixel value
	uint8_t repeat; // the number of times this pixel is repeated
} pixel_t;

int compare_pixel(pixel_t* p1, pixel_t* p2) {
	return p1->r == p2->r && p1->g == p2->g && p1->b == p2->b;
}

void compress_data(uint32_t* image_in, uint32_t image_width, uint32_t image_height, char** image_out, size_t* image_size) {

	/* First build an image header */
	synackuk_compression_header_t header = (synackuk_compression_header_t){.magic = MAGIC, .width=image_width, .height=image_height};

	/* Then we build our output buffer */
	char* buf = malloc(image_width * image_height + sizeof(header));

	/* Copy the header to the start of our buffer */
	memcpy(buf, &header, sizeof(header));


	uint32_t* image_start = (uint32_t*)(buf + sizeof(header));

	uint32_t total_repeats = 0;
	uint32_t output_ptr = 0;

	for(uint32_t i = 0; i < image_width * image_height; i += 1) {
		pixel_t* pixel = (pixel_t*)&image_in[i];
		/* Either we use 0xff, but if that would exceed the image boundaries we use whatever the end of the buffer is */
		uint8_t max_repeat = ((image_width * image_height) - i) > 0xff ? 0xff : ((image_width * image_height) - i);

		uint8_t repeat = 0;
		for(uint8_t j = 1; j < max_repeat; j += 1) {
			pixel_t* next_pixel = (pixel_t*)&image_in[i + j];
			if(compare_pixel(pixel, next_pixel)) {
				repeat += 1;
			}
			else {
				break;
			}
		}
		i += repeat;
		total_repeats += repeat;

		pixel_t* output_pix = (pixel_t*)&image_start[output_ptr];
		memcpy(output_pix, pixel, sizeof(pixel_t));

		output_pix->repeat = repeat + 1;
		output_ptr += 1;
	}

	size_t len = image_width * image_height + sizeof(header) - total_repeats;
	*image_out = buf;
	*image_size = len;
}

This will compress an image by taking the duplicate pixels and collappsing them into a repeat field. To decompress it all we need to do is multiply the pixels by the repeat field like so:


void decompress_data(uint32_t* image_in, char** image_out, size_t* image_size) {
	synackuk_compression_header_t* header = (synackuk_compression_header_t*)image_in;

	if(header->magic != MAGIC) {
		return;
	}

	size_t outbuf_len = header->width * header->height;
	

	char* outbuf = malloc(outbuf_len);

	char* start = ((char*)image_in) + sizeof(synackuk_compression_header_t);
	outbuf_len -= sizeof(synackuk_compression_header_t);

	uint32_t place_in_out = 0;
	for(int i = 0; i < outbuf_len; i += 4) {
		pixel_t* pixel = (pixel_t*)&start[i];
		for(int i = 0; i < pixel->repeat; i += 1) {
			pixel_t* outpixel = (pixel_t*)(outbuf + place_in_out);
			memcpy(outpixel, pixel, sizeof(pixel_t));
			outpixel->repeat = 0xFF;
			place_in_out += 4;
		}
	}
	*image_out = outbuf;
	*image_size = outbuf_len;
}

This was able to reduce a test image I had by 5 times which is pretty good for such a simple compression method.

Conclusion

Though compression is hugely complex, at its most basic level it’s quite trivial. I highly reccomend people have a go at playing around with these ideas, you can learn a great deal about file formats and how basic elements like ASCII are actually designed. The compression algorithm that I wrote for the article is available here