a10.jpg 842468 -> 658121 in 6.4 sec. Stuffit and paq8px beat that at the moment. We did additional tests on larger data sets and the results are about 1% better than paq8px_v67 and 1% behind Stuffit, but Stuffit is faster. I'm going to be looking at some improvements.
We don't actually compress DNA like e.coli. It is actually for SRF files which contain "reads" from DNA sequencing machines. The machines take millions of DNA strands and go through a small number of cycles (35 in the test files I used, but now a few hundred) where they sequence one nucleotide at a time by attaching one variant of C, G, A, or T that fluoresce in different colors. The output of the machine goes to an algorithm that pieces them together. So each read contains an array of elements each with 4 numbers and the interpreted DNA code and a confidence level. Part of the data is poorly compressed with a variant of zip, so we were able to improve on that by uncompressing it and applying a custom algorithm. It beats paq8px and is reasonably fast. The output files are huge (several GB).
Thanks for information
In fact, I'm kinda working on jpeg recompressor now, among other things,
and some goal could be useful.
Btw, what about multiscan (progressive etc), arithmetic code, and colorspace
support in your jpeg coder?
It's actually based on PAQ with some tuning and reduced number of models (12) to make it faster. So it doesn't support progressive model like Stuffit. However progressive mode only makes up a few percent of JPEGs so adding it didn't have high priority. We did have a customer with 12 bit images but it wasn't hard to add that capability. I doubt that we will do the full JPEG standard. Nobody supports arithmetic coding or hierarchical mode AFAIK.
here's a little update for you guys
ok, first, my program is not yet completed as its my first attempt at playing with bytes... i've been able to copy a file byte per byte (noob, i know)
2) the theory is finished, i only have to finish the program, some parts are already done, and i need to test one last thing...
3) my results: in theory, i should be able to compress ANY file by about 50%
this includes already compressed files
i think i can do even better, if that thing i want to test works as i wish
i'll keep you updated if you promise me you'll test it
(i don't think i needed to ask this, i know you won't resist)
i hope it will be finished friday...
Theorem:
No program can compress without loss *all* files of size >= N bits, for
any given integer N >= 0.
Proof:
Assume that the program can compress without loss all files of size >= N
bits. Compress with this program all the 2^N files which have exactly N
bits. All compressed files have at most N-1 bits, so there are at most
(2^N)-1 different compressed files [2^(N-1) files of size N-1, 2^(N-2) of
size N-2, and so on, down to 1 file of size 0]. So at least two different
input files must compress to the same output file. Hence the compression
program cannot be lossless.
The proof is called the "counting argument". It uses the so-called pigeon-hole
principle: you can't put 16 pigeons into 15 holes without using one of the
holes twice.
Much stronger results about the number of incompressible files can be obtained,
but the proofs are a little more complex. (The MINC page http://www.pacminc.com
uses one file of strictly negative size to obtain 2^N instead of (2^N)-1
distinct files of size <= N-1 .)
Note that
no assumption is made about the compression algorithm. The proof applies to
*any* algorithm, including those using an external dictionary, or repeated
application of another algorithm, or combination of different algorithms, or
representation of the data as formulas, etc... All schemes are subject to the
counting argument. There is no need to use information theory to provide a
proof, just very basic mathematics. [People interested in more elaborate proofs
can consult http://wwwvms.utexas.edu/~cbloom/news/nomagic.html ]
In short, the counting argument says that if a lossless compression program
compresses some files, it must expand others, *regardless* of the compression
method, because otherwise there are simply not enough bits to enumerate all
possible output files. Despite the extreme simplicity of this theorem and its
proof, some people still fail to grasp it and waste a lot of time trying to
find a counter-example.
This assumes of course that the information available to the decompressor is
only the bit sequence of the compressed data. If external information such as a
file name, a number of iterations, or a bit length is necessary to decompress
the data, the bits necessary to provide the extra information must be included
in the bit count of the compressed data. Otherwise, it would be sufficient to
consider any input data as a number, use this as the file name, iteration count
or bit length, and pretend that the compressed size is zero. For an example of
storing information in the file name, see the program lmfjyh in the 1993
International Obfuscated C Code Contest, available on all comp.sources.misc
archives (Volume 39, Issue 104).
A common flaw in the algorithms claimed to compress all files is to assume that
arbitrary bit strings can be sent to the decompressor without actually
transmitting their bit length. If the decompressor needs such bit lengths
to decode the data (when the bit strings do not form a prefix code), the
number of bits needed to encode those lengths must be taken into account
in the total size of the compressed data.
Another common (but still incorrect) argument is to assume that for any file,
some still to be discovered algorithm might find a seed for a pseudo-random
number generator which would actually generate the whole sequence of bytes
contained in the file. However this idea still fails to take into account the
counting argument. For example, if the seed is limited to 64 bits, this
algorithm can generate at most 2^64 different files, and thus is unable to
compress *all* files longer than 8 bytes. For more details about this
"magic function theory", see http://www.dogma.net/markn/FAQ.html#Q19
Yet another popular idea is to split the input bit stream into a sequence of
large numbers, and factorize those numbers. Unfortunately, the number of bits
required to encode the factors and their exponents is on average not smaller
than the number of bits of the original bit stream, so this scheme too cannot
compress all data. Another idea also related to primes is to encode each
number as an index into a table of primes and an offset relative to the indexed
prime; this idea doesn't work either because the number of bits required to
encode the index, the offset and the separation between index and offset
is on average not smaller than the number of bits of the original bit stream.
each byte (8 bits) can represent a number between 0 and 255
now add these numbers one after the other (0= 000 and 10 = 010) so it will result in a very large number
i'm using java, so here is how i did it
i use math.pow on a large double and use math.pow again on
(double - resultofpreviousmath.pow)
and the result is about 25 - 30 numbers
these will be the output file.
now to decompress, lets say the output contains 3 numbers --> a, b and c
2^a + 2^b + 2^c = the original number
i split the number in parts of 3 digits and each part will represent a byte of the original file.
in java, the max size of a double is 300 digits
each byte of the file to compress will be converted into 3 digits
300/3 = 100 bytes of the original file
100 bytes will be compressed in 25 - 30 bytes
Indeed, this compresses very well. Actually to ~log(size), so much better than 50%. Too bad that decompression is somewhat worse.
How can you tell whether 010 started as 000,010 or 001, 001 or 010,000?
Actually it's even worse with bigger numbers and more of them.
Read again Bulat's first proof and don't believe in magic.
ADDED:
Bulat:
The MINC page is empty.
ADDED2:
NVM, I'm sleepy and I didn't even see your decompression code. But Bulat explained this, I guess.
Last edited by m^2; 14th January 2010 at 10:36.
i've copied part of comp.compression FAQ
1. double has only 13 digits or so. it can represent numbers from 1e-308 to 1e308 but only 13 digits are preserved. try to compute 1e300+1-1e300
2. but that's not important. let's work with large binary numbers, even if they aren't supported in Java. each byte is 8 bits, 32 bytes are 256 bits. instead of writing these 256 bits directly you wanna to write nu,bers of bits set, i.e. for 010011... you will output (counting from 0) 1,4,5,.... at average, half of bits are set, so instead of writing 256 bits you will write 128 numbers, each 8 bits long, i.e. 1024 bits
i wonder why you think that 300-digit decimal number, i.e. about 1024-digit binary number, will be represented by 25-30 numbers of bits set. it should be ~500
Thanks Bulat! I just realized that even if the max size of a double is 3,xxx *10^308 that number is actually rounded
i did some research and found the BigInteger class. I do not know what is its max number of digits but i managed to make it as big as 28000 digits
but it is in fact only 30000/3 = 10000 ~10kb of the original file
i think 100mb parts would give extremely good ratios without eating the ram too much and without needing too much processing power
that would result in a 300 000 000 digits number
Lone_Wolf, I think it would be the best for you to actually try it and see it doesn't work even when YOU do it
I am... Black_Fox... my discontinued benchmark
"No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates
once again, the magic of big numbers....
http://encode.dreamhosters.com/showthread.php?t=484
OK, so let's take a file, and compress it to 50%.i should be able to compress ANY file by about 50% this includes already compressed files
Now this is a compressed file.
Can you compress it again ?