Thread: loseless data compression method for all digital data type

1. Originally Posted by Shelwien
But why do you think that its possible to losslessly compress a file without any programming skill, just with simple arithmetics?
I don´t think so. I mean using arithmetics/models in conjunction with programming abilities.
Because computers are designed to works only with numbers (bits) 0 and 1.
Next, we have 256 possible values in one byte.
Instead of encoding the data sequence as 00011101 we can simply write "1D" to save some space. This can be further simplified to converting file to right interpretation - EXE, JPG, CHM, DOC, RAR etc, althought I´m not sure HOW it works, because I need to interpret data as text and binary files in RAW text mode are useless unless we use hexadecimal (or other) converter. But the base is still THE SAME. That´s big advantage for us.

Originally Posted by Shelwien
Let's say, we have a megabyte of zeroes encrypted with AES using some unknown password.
Would you compress it by finding the password, or would you believe that there's a magical formula that can do it some other way?
Well, you have 1 000 000 same characters. This can be simplified to "01000000" (that´s not byte, it´s simplification) where first character is yours and second part represents how many (or whatever you want, it depends on software´s ability to understand what it means). The simplest algorithm RLE will be useless here, because we have to store additional character - 0x1000000.
Further, if we don´t have any prior knowledges of input, reduction would not be that strong.
Second part. We are all able to see repeated patterns at every 256th position (farthest possible pos). And if we can predict it with best probability, then we will win.

Lastly, but not the least, remember that one member here write that "we thought that arithmetic coding is the end of it. Now there is an ANS or Assymetric Numeral Systems"? It explains that it´s all about finding the best way to compress data and even incompressible. That means - for now, random data are incompressible, but what about improvements in 10 yrs? I explained that in detail at rarkyan´s welcome post.

2. Thanks:

rarkyan (22nd August 2019)

3. I like the spirit

Thank you CompressMaster

4. > I don´t think so. I mean using arithmetics/models in conjunction with programming abilities.
> Because computers are designed to works only with numbers (bits) 0 and 1.

Its not so simple.
Current cpus are designed to support up to 64-bit data types.
Yes, we can simulate any data types with bit operations, but in practice there'd
be 100x speed difference between hardware 128/64 division and boolean simulation of it.

Its necessary to access some known redundancy in the data and processing is pretty slow even now,
making it 100x slower would turn it impractical.

Same applies to most other operations. Like, arithmetic coding ideally can be implemented
with precise long numbers, but that would have quadratic complexity, so we'd have to
wait years to encode a megabyte file with it.
So in practice we have imprecise ACs modified for cpu precision - there's some invisible
(under a bit) redundancy, but it can be actually used.

> Lastly, but not the least, remember that one member here write that
> "we thought that arithmetic coding is the end of it.
> Now there is an ANS or Assymetric Numeral Systems"?

ANS doesn't provide better compression than AC/RC,
basically you can treat it as AC speed optimization method.

> It explains that it´s all about finding the best way to compress data and even incompressible.

Sure, for example here I made a coder which can compress some otherwise incompressible files:

It also may be possible to compress some other filetypes via cryptography or recompression

But if you can't compute how many bits you need to encode a 256-byte string where each byte
value occurs only once, it means you can't provide any useful ideas for data compression.

5. Eugene, you have more patience than most.

6. Actually I didn't like JamesB's post here: https://encode.su/threads/1176-losel...ll=1#post61331
Authority-based arguments just... don't work.
So I'm trying a different approach - it'd be nice to have one that works, since this kind of discussion repeats infinitely.

7. Originally Posted by Shelwien
Actually I didn't like JamesB's post here: https://encode.su/threads/1176-losel...ll=1#post61331
Authority-based arguments just... don't work.
So I'm trying a different approach - it'd be nice to have one that works, since this kind of discussion repeats infinitely.
Sure, I get your point here. But I think James' frustration is completely understandable. Perhaps there's some way to break these threads into groups? Theory, practice, etc? This thread is clearly not 'in practice' since it mathematically can not be. There either isn't a working demonstration or the understanding of the test data is wrong.

Maybe an established test set of 'random' data for the site would help?

New member: "I have an algorithm that can compress random data!"
Community: "How does it do on encode.su's 'random test suite'?"
New member: "Um, I haven't tried yet."
Community: "come back when you do. And bring you're decoder with you."

Just thinkingbout loud. This "random data" nonsense was the beginning of the end for the compression groups of the past.

8. Originally Posted by Shelwien
Actually I didn't like JamesB's post here: https://encode.su/threads/1176-losel...ll=1#post61331
Authority-based arguments just... don't work.
So I'm trying a different approach - it'd be nice to have one that works, since this kind of discussion repeats infinitely.
I am sympathetic to your patience, but I'd like to mention that it is really unfair to say that JamesB was speaking from the position of authority.
His point had nothing to do with his position or authority. He simply operates under fundamental assumption that one cannot disprove something one does not understand in the first place.
I think it is fair to treat the refusal of some people in this thread to engage with the counting argument as the admission of lack of even basic understanding of what would be involved in countering it.
And then it is just as fair to not support the discussion that surely will not lead to any advancements in theory or practice of data compression.

This thread is a local equivalent of the offices in various Academies of Sciences that used to deal with the inventors of perpetual motion machines. Some of these offices existed for decades. Zero perpetual motion machines were invented, but not for lack of trying, and definitely not because scientists somehow lacked the imagination to think outside of the box or abused their positions of power.

9. There's kinda a plan to make another subforum for this and put these threads there.
But it would only make sense once there's some other activity in this forum.
People kinda stop posting and even visiting after several days of no activity.

> Maybe an established test set of 'random' data for the site would help?

Why, we have a perfect solution for this - http://prize.hutter1.net/

10. Originally Posted by Shelwien
There's kinda a plan to make another subforum for this and put these threads there.
How about we simply differentiate between publicly available source code and no publicly available source code? I realize that there are some instances where legitimate contributions happen but where the code is currently not public. But I think the vast majority of the 'random compression' noise would be filtered simply by virtue of not having a publicly available demonstration which can be peer reviewed or have its conclusions verified or reproduced.

Originally Posted by Shelwien
But it would only make sense once there's some other activity in this forum.
People kinda stop posting and even visiting after several days of no activity.
This is sadly true. But I think there is a tipping point as well where the insane start to run the asylum. Eventually, even those of us who have been here for over a decade would simply stop checking in and worse yet, those who actually have something to actually show will eventually be drown out by the noise of the crackpots.

We have seen this all play out before. I remind us all of this post on comp.compression by non other than Mark Adler:

If anyone has any questions about things like zlib or pigz, then I suggest that you post on stackoverflow.com. I now check there regularly for questions tagged with things like zlib, deflate, inflate, etc. I won't be visiting here anymore. In general, I recommend stackoverflow and its brethren at stackexchange.com when you have questions. They do not however support general discussions. Also if you want to proclaim the virtues of your recursive, random compression scheme that's almost working, then comp.compression is the place for you.

Goodbye, and thanks for all the fish.

This must not happen to encode.su

11. My point wasn't so much to tell people they're nutters (although in this case I'm starting to wonder - sorry folks!) or to lose patience, but to channel the obvious enthusiasm into something *useful*. People may lack the necessary skills to make any useful contribution right now, but you have to start somewhere - even if it's RLE. Pick something that works and play with it. Then try some variations and play with those. Who knows, from a position of ignorance you may just go in a direction no one else considered due to being blinded by "knowing" the right way. However be a scientist. Test your hypothesis. Code it: encode, decode, compare. Who'd have dreamt BWT was around the corner just before it arrived?

However just forget about the random data stuff. It's pointless on so many fronts, the first being it's impossible and even more importantly no one wants to compress random data anyway!

Concentrate on the non-random. It's so much more useful.

12. Thanks:

michael maniscalco (23rd August 2019)

13. Im agree to move this thread into subforum sir. At this point i dont want to disturb any working known theory and other prove which state that we cant create lossless compression on random data. Also i still unable to prove my method. I did test with anything i know, still have zero result. But im still trying other way over and over again.

I dont want to disturb anyone. Only those who have the same curious or maybe want to solve the lossless random data compression by studying the pattern maybe support the discussion. I wish any kind of non valid idea moved to subforum. Visit anytime to help us improve. Thank you

14. I agree on the thread.
Compression by calculus
https://github.com/pjmidgard/Spring?files=1

15. Originally Posted by Shelwien
okay, let's make it specific:
Code:
`openssl enc -e -aes-128-cbc -nopad -in zeroes -out zeroes.aes`
...
Expected result for me.

AES-cbc mode works just like a PRNG (pseudo random number generator).

16. Standard PRNG are pretty bad:

Still, even for AES there're some options.
1) we can try passwords from a good password list and test if a file decrypted with some password is more compressible
2) we can try cracking it with SAT - with modern hardware its not totally impossible, also we can try and invent some advanced method for SAT solving.

Anyway, my point is that there's a known method to compress zeroes.aes to 32 bytes or so (though its very expensive).
Does anybody think that there's _another_ method to compress it that doesn't require decrypting?

17. Originally Posted by rarkyan
Im agree to move this thread into subforum sir. At this point i dont want to disturb any working known theory and other prove which state that we cant create lossless compression on random data. Also i still unable to prove my method. I did test with anything i know, still have zero result. But im still trying other way over and over again.

I dont want to disturb anyone. Only those who have the same curious or maybe want to solve the lossless random data compression by studying the pattern maybe support the discussion. I wish any kind of non valid idea moved to subforum. Visit anytime to help us improve. Thank you
1 byte of 256 variation not enough, just compress images on 100 bytes, we need too take more but less than two bytes.

18. Originally Posted by pacalovasjurijus
1 byte of 256 variation not enough, just compress images on 100 bytes, we need too take more but less than two bytes.

Unclear. Can you give detailed example?

Hex code represented into 256 codes.
Because 1 byte contain 8 bits, so that 256 generated from 2^8 bits.

If i write the results in sequences, it will consist of 2 digits character start from 00 -- ff, which mean if i write sequences using number :
1,2,3,4,5,....99,100,101,...255,256 ---> the bold counted 3 bytes (3 digits lenght).

But if i write using hex code, sequences on the 3 digits lenght will be shorten into 2 digit (2 bytes). It is reducing size

In my opinion, 2^8 able to store 256 pattern sequences.

1 byte size equal to 1 hex code. So there are 256 known hex codes.

Using the possible combination of 256 hex codes to write a sequences, for example using 2 digit lenght then i have :
256^2=65.536 patterns ID (that if use 2 digits). Which mean, i simply write 00 or 0* or anything combined from 2 digit value for maximum 65536 in number sequences. I save 3 bytes for that lenght.

For 3 digits lenght, 256^3=16.777.216 patterns ID, much saving for the byte lenght.

Maybe im wrong, just correct me and will try to understand. Thank you

19. Thanks:

pacalovasjurijus (29th August 2019)

20. I think no one will compress this month random files they are incompressible. But months have passed, they are shrinking.

21. According to the latest tests, random files of this and last month are compressed.

22. Well some people say: no pict = hoax

Good job on compressing random files. But other always need a proof of working compression.
I dont have a proof yet, just a concept and testing and trying with my very limited knowledge and tools.

Anyway, thank you for joining the random compression. Keep move forward

23. The first working version compresses files random.
https://github.com/pjmidgard/Spring_1.0.0.2?files=1

24. Thanks for actually working on a compressor that can be tested. This can help us to better explain where your error is.

I tested the compressor using files of 2 bytes. They actually get larger even when using short filenames, but I actually want to show you something else: The violation of the pigeonhole principle.

So let's try some 2 byte input files:

Code:
```Input bytes (hex)   Input filename  Output bytes (hex)
00 00                               [Python ERROR]
01 00                               [Python ERROR]
10 00                               [Python ERROR]
40 00               test1.dat       74 65 73 74 31 2E 64 61 74 2F 30 00 22
80 00               test2.dat       74 65 73 74 32 2E 64 61 74 2F 30 00 22
80 00               test1.dat       74 65 73 74 31 2E 64 61 74 2F 30 00 22```
The first three actually show a bug with your code, the python error is: "line 349: "while z<xc:" - name 'z' is not defined", so this part of the code is still buggy.

The last three are violating the pigeonhole principle. If the same filename (test1.dat) is used, output is identical for different inputs - this means the compressor is not lossless. So when we would want to decompress our compressed file, there is no way for the decompressor to tell if it should decompress "40 00" or "80 00". In other words: Information isn't compressed, but it gets lost.

That's what you've been told here again and again: It's fine to have a compressor, but without working on a decompressor you'll easily run into that non-lossless trap where you get smaller files, but information is lost and you are not able to reconstruct the original file again.

So, have fun with fixing these bugs, files test1.dat and test2.dat are attached in a ZIP archive so you can reproduce the problem. Also created an issue on GitHub.

Finally, the best advice I can give you is: Don't work on a compressor without having a working decompressor.

25. Thanks (3):

Gotty (6th September 2019),pacalovasjurijus (6th September 2019),Shelwien (6th September 2019)

26. Errors have been fixed. Files are now different. Thanks for the help.

27. I have question. Maybe someone can answer it.

Mr. Matt Mahoney give me code to generate 2^75000000 which is on his statement, it might be run in a very long time.

"here is your program to count to 2^75000000 in binary. Please run it and wait for it to finish. Then let us know how long it took."
I dont know whether it is possible to get the result or not, i did test it long long time ago. But since it doesnt have pause menu, i cant continue it.

"The algorithm itself would not be difficult to implement and would work for any N, even the mentioned N = 1.000.000"
I once asked on some math forum about computer ability to generate 2^75000000. I forget the link, but as i remembered one of their representative answer that it will run in "infinite" time.
-------------------------------------------
And the question is :
2^75000000
Is it possible?
How long the process if its possible? Infinite or not

-------------------------------------------
Im continuing my experiment.

28. Originally Posted by rarkyan

Mr. Matt Mahoney give me code to generate 2^75000000 which is on his statement, it might be run in a very long time.
As for randomness, random data are incompressible, that´s true. But since randomness does not exists, they´re all compressible, only we need to find correct patterns. We still have 256 bytes even in text document where each character is unique. It depends only on selected interpretation of your data.

Mr. Matt Mahoney tries to persuade you that you have a method that simply does not work at all. I DISAGREE WITH THAT! Well, randomness DOES NOT EXISTS AT ALL, its all about finding better and better way to minimize randomness and improve compression.

As to BARF, it´s a fake software that does not compress anything. It was written as a joke to debunk claims that random compression is impossible, because some people claimed that they´re able to compress random data recursively. Again, it´s not possible to compress random data, because they does not exists at all - we still have some (and lot of, yet randomly occuring) patterns in it. Of course, infinite compression is impossible - some info MUST be stored. But recursive random data compression might be possible some day. It´s all about finding better and better ways how to do something. I believe... maybe I have an overestimated expectations, but never say never...

29. Please, help me with python. I want to add processes to here. I have 8 core computer. Can't find anything from find this website: https://docs.python.org/3.1/library/...rocessing.html What do I need to do which my code? Here is my code of python:

```
block=147456
bnk=1
virationc=16383
kl=block
bnk=pow(virationc,kl)
```

Here is my Whole code:

30. ```

from multiprocessing import Pool,Value

def f(x):

return (x+1)**16383

if __name__ == '__main__':
pool = Pool(processes=4)               # start 4 worker processes
result = pool.apply_async(f, [147456])     # evaluate "f(10)" asynchronously
#print(result.get(timeout=1))           # prints "100" unless your computer is *very* slow
#print(pool.map(f, range(147456)))          # prints "[0, 1, 4,..., 81]"

bnkw=pool.map(f, range(147456))

import binascii
import json

block=147456
blockw=147455
blockw1=16384
virationc=16383
bitc=14
lenf1=0
a=0
qfl=0
h=0
byteb=""
notexist=""
lenf=0
dd=0
numberschangenotexistq = []
qwa=0
z=0
m = []
p=0
asd=""
b=0
szx=""
asf2="0b"
while b<blockw1:
m+=[-1]
b=b+1
k = []
wer=""
numberschangenotexist = []
numbers = []
name = input("What is name of file? ")
namea="file.Spring"
namem=name+"/"
s=""
qwt=""
sda=""
ert=0
aqwer=0
aqwq=0
aqwers=0
qwaw=""

with open(namea, "w") as f4:
f4.write(s)
with open(namea, "a") as f3:
f3.write(namem)
with open(name, "rb") as binary_file:
lenf1=len(data)

if lenf1<900000:
print("This file is too small");
raise SystemExit
s=str(data)
lenf=len(data)
while dd<3000:

a=0
qfl=0
h=0
byteb=""
notexist=""
lenf=0

numberschangenotexistq = []
qwa=0
z=0
m = []
p=0
asd=""
b=0
szx=""
asf2="0b"
while b<blockw1:
m+=[-1]
b=b+1
k = []
wer=""
numberschangenotexist = []
numbers = []
s=""
qwt=""
ert=0
aqwer=0
aqwq=0
aqwers=0
qwaw=""
dd=dd+1
if dd==1:
sda=bin(int(binascii.hexlify(data),16))[2:]
szx=""
lenf=len(sda)
xc=8-lenf%8
z=0
if xc!=0:
if xc!=8:
while z<xc:
szx="0"+szx
z=z+1
sda=szx+sda
lenf=len(sda)
szx=""

for byte in sda:
aqwer=aqwer+1
aqwers=aqwers+1
qwaw=qwaw+byte
if aqwer<=bitc:
qwt=qwt+byte
if aqwer==bitc:
aqwq=int(qwt,2)
qwt=""
a=a+1
h=h+1
av=bin(aqwq)
if a<=block and aqwer==bitc:
aqwer=0
m[aqwq] = aqwq
numbers.append(aqwq)
if a == block:
qwaw=""
p=0
while p<blockw1:
if p!=m[p]:
k.append(p)
p=p+1
lenfg=len(k)

if lenfg>0:
acvb=lenfg-1
notexist=k[acvb]
if notexist<8192:
raise SystemExit
notexist=notexist-8192
szx=bin(notexist)[2:]
lenf=len(szx)
xc=13-lenf
notexist=notexist+8192
z=0
if xc!=0:
while z<xc:
szx="0"+szx
z=z+1
wer=wer+szx
lenf=len(szx)
szx=""
if lenfg==0:
raise SystemExit
b=-1
kl=blockw
cb=0
er=-1
ghj=0
ghjd=1
bnk=1
p=0
cvz=0

for p in range(blockw):
if lenfg>0:
if virationc!=numbers[p]:
byteb=numbers[p]
numberschangenotexist.append(byteb)
if virationc==numbers[p]:
numberschangenotexist.append(notexist)
ghj=numberschangenotexist[p]
qfl=qfl+1
ghjd=ghj
bnk=1
bnkd=1
kl=kl-1
qwa=qwa+1
if lenfg>0:

bnk=bnkw[kl]
ghjd=0
ghjd=ghj*bnk
cvz=cvz+ghjd
szx=bin(cvz)[2:]
cvz=0
lenf=len(szx)
if lenfg>0:
xc=2064370-lenf
z=0
if xc!=0:
while z<xc:
szx="0"+szx
z=z+1
wer=wer+szx
lenf=len(szx)
szx=""
a=0
numberschangenotexist = []
del k[:]
del numbers[:]
m = []
b=0
while b<blockw1:
m+=[-1]
b=b+1
b=0

a=0
wer=wer+qwaw
qwaw=""
wer="1"+wer+"1"
if dd==3000:
lenf=len(wer)
xc=8-lenf%8
z=0
if xc!=0:
if xc!=8:
while z<xc:
szx="0"+szx
z=z+1
wer=wer+szx
lenf=len(szx)
szx=""
wer="0b"+wer
n = int(wer, 2)
jl=binascii.unhexlify('%x' % n)
sda=wer
with open(namea, "ab") as f2ww:
f2ww.write(jl)

```

31. from multiprocessing import Pool,Value

def f(x):

return (x+1)**16383

if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [147456]) # evaluate "f(10)" asynchronously
#print(result.get(timeout=1)) # prints "100" unless your computer is *very* slow
#print(pool.map(f, range(147456))) # prints "[0, 1, 4,..., 81]"

bnkw=pool.map(f, range(147456))

import binascii
I dont know how to read this. Maybe i can ask my friend to"translate" it. I'll see what i can do to help

32. I have 4 core I have made faster 8 times but we can 10 times faster. Now speed 80 bytes per second. Please, Help me with the process.

33. Spring_1.0.0.2

34. Spring 1.0.0.3
Speed 40 KB/s

Page 7 of 9 First ... 56789 Last

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•