# Thread: loseless data compression method for all digital data type

1. Originally Posted by pacalovasjurijus
WhiteHall paq 1.0.0.1.3 version 1.0.0.1.3
Before 100 000 000Bytes Enwik8
After 49Bytes Enwik8.b
Time 4 hours 10 minutes.
Are you able to decompress it back to byte-by-byte identical enwik8?

2. I think I need to try algorithm ilin and prng.

3. Originally Posted by pacalovasjurijus
I think I need to try algorithm ilin and prng.
So is that a "No" to the question about a working decompressor ?

4. I have taken HEX exactly the same as I am half way through(maths theorem). i have followed lossless compression algorithm it can compress again and again with greater than 40%

5. Originally Posted by uhost
I have taken HEX exactly the same as I am half way through(maths theorem). i have followed lossless compression algorithm it can compress again and again with greater than 40%
Still not answered on a working decompressor.
with no decompressor you results are pathetic. I can compress anything to 1 bit if decompression is not needed.

Put up or shut up

6. ​My first method failed. Proved by Schaader at this page. Now im on a second attempt (trying version 2).

Question :
1 day = 86.400 seconds, can i consider it as compressed seconds by calling it by "1 day" instead of counting "86.400" seconds passed?.

Its the same size like 1000 kb = 1mb, 1000gr = 1kg. But the string is shorter. Is it compressed string?

7. > Its the same size like 1000 kb = 1mb, 1000gr = 1kg. But the string is shorter. Is it compressed string?

Yes in some cases (original string is purely ascii decimal, no entropy coding), not so simple otherwise.

8. ## Thanks:

rarkyan (10th October 2020)

9. Thanks. And talk about sequence or pattern, it maybe infinite. Im still observe the pattern in random data. But for the basic, im trying to create a mechanism that works like a clock. Like seconds, minute, hour, day, month, year etc.

A sequence of seconds turn into minute when they hit 60. Converted into 1 minute.

As for the compression, im trying to create short ID to simplified the sequence. Or create a notation of certain sequence. Im still on observation on this thing. But maybe if someone have a good reference, it may help me to learn.

10. Originally Posted by rarkyan
Thanks. And talk about sequence or pattern, it maybe infinite. Im still observe the pattern in random data. But for the basic, im trying to create a mechanism that works like a clock. Like seconds, minute, hour, day, month, year etc.

A sequence of seconds turn into minute when they hit 60. Converted into 1 minute.

As for the compression, im trying to create short ID to simplified the sequence. Or create a notation of certain sequence. Im still on observation on this thing. But maybe if someone have a good reference, it may help me to learn.
I actually tried this 17 years ago in 2003, because I was trying to see if storing data in the differential delta of time was worthwhile or not. I used a cesium 133 atomic clock paired with a quartz crystal from a PC to time, respond, send, and receive. The quartz crystal was accurate to 18.2 clock ticks per second.

This meant 18.2 * 60 = 1092 states per second. A little over 256 full ascii cycles * 4. But it wasn't good for data compression or even transmission, and i'll explain why.

The pigeon hole throry that is applicable to data compression is applicable to data transfer and communication as well.

If you were sending 1 byte then you'd have 1092 states per second - 256 ascii literal states = 836 extended states to work with. But you'd need to account for 65536 - 256 literal = 65280 extra states the moment you added more than 1 byte. And beyond 2 bytes it gets even more heavy. With 3 bytes you'd need to handle 16,777,216 states per second to send or receive 3 bytes in time. Yes, that is 3 bytes per second only.

Now consider the fact that since the 1960's and 1970's we've been able to use landline telephone modems at at least 150 baud. That's 150 bits per SECOND = 18.75 BYTES PER SECOND. Then during the 1980's came 300, 600, 1200, 2400, 4800, 9600, and 19200 baud (14.4k modems), then 28.8, then 56k modems, and then broadband generations to where now several MEGABYTES PER SECOND can be transfered and stored.

The best that a single medium-free method is able to do with accurately measurable universal time is between 1.8 to 2 bytes per second, and that's even with tricks assigning the extra slots to a sliding window of previously seen data. You might get 3 bytes per second out of it if lucky. But the more you try to send at once, the less you get out of the medium.

Instead, they were able to get data transfer to where it is today by sending as many bits as 1/0 on/off states per second (with error correction) and then realized that they could do this nearer to the speed of light with fiberoptic channels working in tandem, and later could approach speed of electricity and sound waves still but use the equivalent channels or more using electromahnetic frequencies which your wifi, cellular, and other transmission systems are using today to achieve very high spreeds.

Although the method does work for transmission...it is a little over 2 BYTES per second and that's it. If you went to 1 byte per second, you get 836 states you can work with. If you reduced that to a nibble per second you would have 1076 states to work with rather than 836 even though the literal becomes half a byte. If you were to send 1 bit per second as a literal then, you would have 1091 additional states to work with per 18.2 cycles per second.

So while you get more out of it going smaller and multiplexing the channel medium, you merely stay the same ad you go larger (or it takes exponentially MORE time to save, transmit, or store data as you go from 256^1 to 256^2, then again at 256^3, then 256^4, etc).

Even if you were to multiplex the 2 bytes per second it can transfer, the transmitter and receiver have to have perfect synchronization first AND a way of having a stop bit or end transmission, and a 150 to 300 baud modem from the 1970's would still have it beat on sending or receiving dats.

Although there are 86,400 seconds per day and 31,536,000 seconds per year, whether you assign 1y or 1d to these it wouldn't matter because for compression to occur it would have to be EXACT to the constants...and the problem is that it isn't, because each of the other states whether you're dealing with 2 bytes and 65,536 different states or 4 bytes and 4,294,967,296 different states needs to be represented and accounted for.

This means that even if you said time required to send or receive data is 4 days, 3 hours, 56 minutes, 23 seconds, and 4 miliseconds...what it takes to represent that exactly is the same or larger to do that. :-/

No free lunch...yet!

Maybe there is a way still, but unfortunately it doesn't work like this, because while a constant can be used as a character representation to compress a KNOWN constant like Pi or the C or F for Celsius or Farenheit to represent it in a formula, when trying to do that with random data the result is a 1:1 or larger of what we try to represent as a placeholder between those known constants.

11. Apologies for any typos and letters; trying to type this from an android phone rather than a PC keyboard.

12. ## Thanks:

rarkyan (12th October 2020)

13. Thank you very much for the explanation and great practical experience. But at some point, i don’t understand about the data transfer for transmission.

I agree with this :

“Although there are 86,400 seconds per day and 31,536,000 seconds per year, whether you assign 1y or 1d to these it wouldn't matter because for compression to occur it would have to be EXACT to the constants...and the problem is that it isn't, because each of the other states whether you're dealing with 2 bytes and 65,536 different states or 4 bytes and 4,294,967,296 different states needs to be represented and accounted for.

This means that even if you said time required to send or receive data is 4 days, 3 hours, 56 minutes, 23 seconds, and 4 miliseconds...what it takes to represent that exactly is the same or larger to do that. :-/”

And maybe this :

“So while you get more out of it going smaller and multiplexing the channel medium, you merely stay the same ad you go larger (or it takes exponentially MORE time to save, transmit, or store data as you go from 256^1 to 256^2, then again at 256^3, then 256^4, etc).”
----------------------------------------
I see something unique in the pattern of random data. I need to give them name for each pattern, make it short as possible to call the full pattern structure. But the challenge is that I must maintain the ID stock. Since ID is unique, I cant use it twice for different pattern.

About the pidgeon hole principle :

Pigeons in holes. Here there are n = 10 pigeons in m = 9 holes. Since 10 is greater than 9, the pigeonhole principle says that at least one hole has more than one pigeon. (The top left hole has 2 pigeons.)
(https://en.wikipedia.org/wiki/Pigeonhole_principle)

I don’t know whether pidgeon holes applies on my method or not, but this is what I want to achieve :
· First method im using 2^n formula to generate ID. But this method didn’t works since it will create whole patttern and the ID seems will have greater digits then the specific pattern itself.

· On second method, I try to use combination formula (nCr) to minimize the ID stock. But I need to make sure they really able to give much shorter byte (in string) than the bytes in pattern itself. For example :

Here a “test.bmp” file (size is 384 bytes) :
*) my apologize always try on small file size

Open in hex editor :

From the hex list there are 64 (00 not included) hex codes to build the file. I sort them by a-z :
02, 03, 07, 0e, 0f, 18, 1b, 1c, 1e, 20, 23, 24, 27, 28, 38, 3b, 3e, 3f, 40, 42, 47, 49, 4d, 4f, 60, 61, 62, 6c, 6d, 6f, 70, 71, 76, 79, 7c, 7e, 7f, 80, 81, 8c, 8d, 8f, 90, 93, 9c, 9e, c0, c3, c7, d8, db, dc, e0, e1, e4, e7, ec, ed, ee, f0, f2, fc, fe, ff

Now I search for hex “24” :

That’s the pattern. Now focus at one column (07) for the sample and :

Hex “24” on column 07. Total 24 rows, they took place at coordinate 8,9,23 (3 of them).
By combination formula, it will be nCr
n = 24 (rows)
r = 3 (hex code count)
24C3 = 2.024 patterns (and of course with 2.024 different ID)

Now for the ID, I need to map all of the 2.024 pattern. In data compression, short byte ID matter because it will reduce the size. Let say, I will create 2 digit ID using all 1 byte character available. There are 256 (1 byte per hex code), so I have 256^2 = 65.536 ID

I cant create the ID (need programmer to do this), but lets assume that hex “24” ID is “#a”.

There is how I just write #a on output table lets say just contain text and string thing. To return the pattern, spin computer to process 24C3 until they find #a and put back the pattern into hex editor form.

For the column 08 and 0f, process are same.

This is just example. The output table will not compress the test file because the output table file size determined by :
(n hex row * m hex digit) + (r digit id * 16 column * n hex row)
maximum hex is 255 (00 not count)

In this example it should be :
hex row = 64
hex digit = 2
digit id = 2
column = 16

(64 * 2) + (2 * 16 * 64) = 128 + 2048
=2176 byte

*) size tested in plain text document/notepad.

I don’t know where the error will happen. If there is a mistake, please let me know. Thanks for the help.

14. Originally Posted by rarkyan
Thank you very much for the explanation and great practical experience. But at some point, i don’t understand about the data transfer for transmission.

I agree with this :

“Although there are 86,400 seconds per day and 31,536,000 seconds per year, whether you assign 1y or 1d to these it wouldn't matter because for compression to occur it would have to be EXACT to the constants...and the problem is that it isn't, because each of the other states whether you're dealing with 2 bytes and 65,536 different states or 4 bytes and 4,294,967,296 different states needs to be represented and accounted for.

This means that even if you said time required to send or receive data is 4 days, 3 hours, 56 minutes, 23 seconds, and 4 miliseconds...what it takes to represent that exactly is the same or larger to do that. :-/”

And maybe this :

“So while you get more out of it going smaller and multiplexing the channel medium, you merely stay the same ad you go larger (or it takes exponentially MORE time to save, transmit, or store data as you go from 256^1 to 256^2, then again at 256^3, then 256^4, etc).”
----------------------------------------
I see something unique in the pattern of random data. I need to give them name for each pattern, make it short as possible to call the full pattern structure. But the challenge is that I must maintain the ID stock. Since ID is unique, I cant use it twice for different pattern.

About the pidgeon hole principle :

Pigeons in holes. Here there are n = 10 pigeons in m = 9 holes. Since 10 is greater than 9, the pigeonhole principle says that at least one hole has more than one pigeon. (The top left hole has 2 pigeons.)
(https://en.wikipedia.org/wiki/Pigeonhole_principle)

I don’t know whether pidgeon holes applies on my method or not, but this is what I want to achieve :
· First method im using 2^n formula to generate ID. But this method didn’t works since it will create whole patttern and the ID seems will have greater digits then the specific pattern itself.

· On second method, I try to use combination formula (nCr) to minimize the ID stock. But I need to make sure they really able to give much shorter byte (in string) than the bytes in pattern itself. For example :

Here a “test.bmp” file (size is 384 bytes) :
*) my apologize always try on small file size

Open in hex editor :

From the hex list there are 64 (00 not included) hex codes to build the file. I sort them by a-z :
02, 03, 07, 0e, 0f, 18, 1b, 1c, 1e, 20, 23, 24, 27, 28, 38, 3b, 3e, 3f, 40, 42, 47, 49, 4d, 4f, 60, 61, 62, 6c, 6d, 6f, 70, 71, 76, 79, 7c, 7e, 7f, 80, 81, 8c, 8d, 8f, 90, 93, 9c, 9e, c0, c3, c7, d8, db, dc, e0, e1, e4, e7, ec, ed, ee, f0, f2, fc, fe, ff

Now I search for hex “24” :

That’s the pattern. Now focus at one column (07) for the sample and :

Hex “24” on column 07. Total 24 rows, they took place at coordinate 8,9,23 (3 of them).
By combination formula, it will be nCr
n = 24 (rows)
r = 3 (hex code count)
24C3 = 2.024 patterns (and of course with 2.024 different ID)

Now for the ID, I need to map all of the 2.024 pattern. In data compression, short byte ID matter because it will reduce the size. Let say, I will create 2 digit ID using all 1 byte character available. There are 256 (1 byte per hex code), so I have 256^2 = 65.536 ID

I cant create the ID (need programmer to do this), but lets assume that hex “24” ID is “#a”.

There is how I just write #a on output table lets say just contain text and string thing. To return the pattern, spin computer to process 24C3 until they find #a and put back the pattern into hex editor form.

For the column 08 and 0f, process are same.

This is just example. The output table will not compress the test file because the output table file size determined by :
(n hex row * m hex digit) + (r digit id * 16 column * n hex row)
maximum hex is 255 (00 not count)

In this example it should be :
hex row = 64
hex digit = 2
digit id = 2
column = 16

(64 * 2) + (2 * 16 * 64) = 128 + 2048
=2176 byte

*) size tested in plain text document/notepad.

I don’t know where the error will happen. If there is a mistake, please let me know. Thanks for the help.
You're welcome. Hopefully I understand correctly, but you are saying that with your transformation of the hex data and rows that you're able to create a result that only needs 2,024 states rather than 2,048 to account for all unique possibilities?

If that is the case, then you can reduce the last 24 patterns of a 12 but sequence that goes from 000000000000 to 111111111111 by 1 bit, and have the last 24 patterns be 11 bits rather than 12. You might have to read 3 hex symbols at a time to do that since each hex digit is a nibble (4 bits per symbol). But if you were able to do that, then every time the data landed on the 24 most frequent hex symbols it would compress by 1 bit. But this is assuming the transformation somehow accounts for all states.

If it doesn't, then if you're using a lookup table, most of the data will return correct but the states that overlap will cause errors when you try to return data back to 1:1 states with decompression. (You could still make that work with a lossy algorithm for grahpics and audio with a few adjustments, but not lossless because you'd have to avoid data tables that produce errors and decompresses incorrectly and there would be no way to do that).

If it is different from that, I am not sure how you are translating the hex bytes that overlap unless there is something I'm not seeing like a substitution algorithm that treats non-aligned bytes as search and replace matches similar to when you use a text editor to search and replace data and replace it with a smaller symbol or string?

(The use of time and data constants may not be applicable to your algorithm outside of an absolute for data transmission with time as a medium, so apologies if that caused any confusion. The principle of constants and states would still apply, though. I'm trying to make sure I understand the way you are approaching the data first before I respond back. Thanks )

15. I will wait to see your response but another thing I'm wondering is where the compression is happening because on the example BMP file with 384 bytes, it gets expanded to 2176 bytes as a text file containing hex symbols that are each 4 bits per symbol. If you were to convert that back to 8 bit ASCII, you'd divide the amount of hex symbols by 2 and get 1,088 bytes which is 704 bytes larger than the original uncompressed BMP of 384 bytes?

16. "If it doesn't, then if you're using a lookup table, most of the data will return correct but the states that overlap will cause errors when you try to return data back to 1:1 states with decompression."

I think it will not overlap because each hex on the data structure are mapped. From the beginning they are not overlap each other
---------------------------

"If it is different from that, I am not sure how you are translating the hex bytes that overlap unless there is something I'm not seeing like a substitution algorithm that treats non-aligned bytes as search and replace matches similar to when you use a text editor to search and replace data and replace it with a smaller symbol or string?"

Yes im trying to replace the sequence of hex pattern using a string, as short as possible. Talk about clock mechanism, let say the smallest units is in seconds, minute, hour, day, etc. 60s = 1min, second will always tick, but we dont call the next tick as 61 seconds but 1min 1s. Something similiar like that. First i must create a limit to the string itself using x digit to replace the pattern. Using 2 digit i can generate 256^2 = 65.536 different name. Using 3 digit i can generate 256^3 = 16.777.216, etc. Problem is, its hard for me to observe and trying on actual file. I know if the hex pattern sequence is < 256^n string (ID), then im pretty sure this is where the compression happen. But since i cant create the program to observe the sample, this explanation maybe lead to misunderstand.
---------------------------

"I will wait to see your response but another thing I'm wondering is where the compression is happening because on the example BMP file with 384 bytes, it gets expanded to 2176 bytes as a text file containing hex symbols that are each 4 bits per symbol"

The compression happen when the complete pattern are created and replaced by short string. In the example, my files gets expanded into 2176 bytes because they didnt meet the output file requirement. File too small, and output file write too many string. I need to check them at large file but i need programmers help.

If anyone want to be a volunteer, or maybe want to help me create the program i am very grateful.
​Thanks

Page 9 of 9 First ... 789

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•