Ko-fi

Friday, 15 February 2019

Iteratively Sequenced Data Compression

I first came across data compression back in the 1980s, some 5 or 6 years after first being awestruck by this thing called a computer.

At the time I was very young, and these computer things were new - and, by all accounts, a pretty big deal. I was fortunate enough to have one at home that I shared with my brother. It was a BBC Micro Model-B from Acorn Computers. Go Acorn!

It was awesome.

This thing was competing with a trusty BMX, He-Man, haunted house exploring, and lots of shenanigans. So it really did have to be awesome to get a look in.

It had 8K of memory. Eight freaking K. An unimaginably large number of transistors that you couldn't dream of filling! Right? Well, no. It turns out that you run out of 8K memory pretty fast... even back in the day.

Back then I was writing 6502 assembly language, dabbling in the 8-bit games industry, designing graphics on graph paper with a pencil (until I wrote a frickin' nifty sprite editor), and maps, because dammit, I was a games programmer! High school and night classes at college were just a necessary pastime. Game programming. That was going to be my life.

And then in 1988 Superior Software brought out an amazing new game called Exile. It was literally a game changer - no pun intended. It was a great game, with many hours and fond memories with buddies spent trying to get to the end of it. I believe it was the first, or one of the first games, that sported an early physics engine... gravity and inertia at least. It was amazing. The graphics were, by 8-bit 8-colour standards, intricate and plentiful. The map was huge. It broke so many rules of what was possible.

All of that was, of course, fascinating and made for a great game.

Only I like to learn and tinker. So tinker and learn I did.

One of the claims that this game made was around the size of the graphics, the size of the sound, and the map. The numbers were staggering - significantly more than 8K!

Whoa!?

How does that work, I thought. It won't fit.


I pondered.

I reached out to friends.

We pondered together.

It did not make sense.

Back then, there was no Internet. There was not an endless amount of resources, or of people doing this stuff to reach out to, especially not outside of universities.

So I took to doing what I would always do in these scenarios. Take it apart. Take a look. Figure out what the hell was going on.

I was pretty resourceful when it came to getting into computers and code, and by this time I was a dab hand at reading machine code; not just assembly language, I mean actual hexadecimal compiled code and understanding exactly what it was up to - often printed out on reams of dot matrix printer paper that I would read travelling about in the back of my parents awesome Rover 2600. Pretty geeky. To be fair, 6502 machine code was a hell of a lot more simple compared to today's machine code, and you had to know it back then to figure out clock cycles and see where you could save the odd millisecond.

Anyway, the point is, Superior Software got all of this content that was bigger than the thing it was going in to by using compression. I genuinely agonized for some time how a bit would get smaller than a one or a zero, came to the conclusion it couldn't, so went to the byte level, and I was still stuck. So then I did some tinkering. You may choose to call it hacking. Whatevs. That, along with talking it through with a buddy, came up with the realisation of what is now known as RLE; run-length encoding.

It's a very basic compression technique whereby you don't store the data itself, rather, you store the pattern of data.

Mindblown.

You don't store the data.

Mind. Blown.

These days it's unbelievably obvious of course and a very basic algo.

Data is quite often a series of the same value, particularly in graphics, maps, and sound. So for example, instead of storing 28 zero's for a strip of black in your image, you store the number 28, followed by the number 0. Then your code decompresses it by reading the 28, 0, and expands it back out to 28 occurrences of 0. The thing is, you only stored two bytes (the 28 and the 0) instead of 28 bytes (28 consecutive 0's). So if you do that across your entire set of graphics, sound, and map data, you get something big into a smaller space.

Easy.

Or rather, easy when you know how.

So... that's a very long intro to a possible new compression technique that, I hasten to add, would make a pretty neat TED talk intro. The important part to take away is this...

You don't store the data.

So I've concluded that if we don't care about the data, and we're never truly going to store the data, and we want lossless compression, well then, let's not even go down that path. Let's not figure out clever algorithms to look at data in different ways - for example looking in cubic data not strips of data; losing quality by deciding on a loss factor and figuring out if the value is "close enough" for the loss, or breaking graphical data into meaningful slices, e.g. filter the red component of the RGB values, compress, then do green, and so on and so forth... there are many techniques.

Here's the thing...

All we need is an established public library of bit sequences.

I've long held the belief that every image, and indeed every sound, can be generated through tedious incremental change; therefore all the patterns of data already exist.

To that point; all data that will exist already exists.

Don't dwell on that, just run with it and maybe I will write about that some other time.

To record massive data sets as single sequences would be borderline silly because of the indexing time required. So we break the full data set into sensible, optimised, chunks. Then we simply index them against our established public sequence library. We do not store the sequence library itself because they are established public sequences, and remember; we do not store the data.

Then you repeat the process on your newly generated sequence because there will be a public library sequence for that too. You continue to repeat the process until you have an incredibly tiny amount of metadata where all you need to store is the iteration depth and the sequence numbers starting with the seed sequence. Decompression would be the reverse process.

The resulting compressed data only ever consists of metadata to describe what the data was when it did exist.

The public sequence library would contain multiple sequence lengths, so you would check your data size, compare it to the library length that best fits, and as you iterate and compress you step back to the next smallest size until you run out of practical compression.

So depending upon the chunk size for the sequences, a multi-gigabyte file could easily be compressed to a few kilobytes at most, possibly even bytes.

Imagine being able to download an entire UHD film in mere seconds over the crappiest of narrow bandwidth, and saving a decade's worth of images on a tiny drive.

That would be awesome.




Concept (c) 2019 letsbuildathing.com

3 comments:

  1. Well you came across the Jan sloot story..

    ReplyDelete
  2. Thanks... I was completely unaware, and have been Googling since your comment - fascinating stuff! I expect the process is very similar, the point being that you don't need to record the data itself. Now I feel more compelled to complete the code I've been working on to make this real. There will be an update to this article soon too because I've been hitting some issues with it... so reading about Jan Sloot makes me think if the technique is essentially the same, or at least very similar, he would have been hitting the exact same issue I'm having with this now.

    ReplyDelete
  3. If you like we can have a chat about that: https://us23.chatzy.com/42347702449070

    ReplyDelete