http://bzip.org
My quick compile attached to this message!![]()
http://bzip.org
My quick compile attached to this message!![]()
Hello everyone,
thanks for sharing this
Best regards!
I have not looked at BZIP2 or at least not for several years. But does any one have an idea how hard it would be to replace the BWT with a BWTS. Of course you could leave the headers and interface the same and have a phony index. But why waste the space with the index. It might be fun to just try it and I may do it. But since many have toyed with this do you think it would be hard to modify?
It would break compatibility with existing .bz2 files and not provide any advantage in speed or compression ratio besides saving 20 bits per 1 MB block for the starting index. People use bzip2 even though there are better compressors because they care about others being able to decompress the files later. Still, you should try it because maybe I am wrong and there are practical benefits. Breaking up a block into Lyndon words would lose context in stationary sources but might help compression when fast adaptation is needed, but then so would splitting a file into smaller blocks with different data types.
During my working career I was an expert at adding features yet keeping things compatible. It might be child's play to at least allow things compressed with older version to be compatible with a BWTS version. It depends on what the guy who wrote the software did for what he is claiming as the BWT every one does it different and I don't mean just the index. If he used the original BWT where a rotation in the input data only changes the index its very easy. If he at least used a version of BWT where if the input data rotated such that the BWT causes a zero value for index and the input at this point would be a single lyndon word at which point many of the different BWT methods give the same transformed value as the BWTS. If this is the case then you could use in decompression the old BWT routines if index not zero. If index zero you just use the BWTS routines. To keeps it upward compatible on decompression you make the index zero for the BWTSed part.
This means old BZIP2 could compress and decompress files. While a new BWTS could at least decompress any old BWT BZIP2 files. In a slow mode you could compress blocks twice once with BWT and once with BWTS on decompression the fact that index not zero in a block by block bais would single the old style was used. This would mean any program compressed with new slow version would always compress the same or better. In other words a BWTS version of BZIP2 if one wished could always compress the same or better.
Did you say you were going to write it?![]()
Not exactly. I would need to know just where and how the interface is in his code for BWT and UNBWT. I would also need to see what he gets for some simple rotations of short files. I use short data strings like BANANAS to test for what method is being used. If he does the real original one he will get a different index value for each rotation of the word but the same BWT data set which is the BWTS of ANANASB
most implementations I have seen would get an index of 0 for the BWT of ANANASB which is BNNSAAA which is the same as the BWTS
so I use his old method if index not zero when uncompressing and since BWTS would in this case always return 0 I would use the UNBWTS for this case. Its possible the BWT method he uses returns the single word case for a lyndon word rotated one character to right of left. In which case I would rotate the BWTS one character right of left to remain comparable. Look I moded BWTMIX to use BWTS it can't be that hard unless he is using some oddball implementation for BWT.
This could be used when the next person rewrites a new BZIPX program. He might decided if he understands both to actually start with the BWTS instead of some form of BWT. I realy think if BWTS was invented first no one would be amazed by or even use the BWT form since it add extra data to a file and the fact is UNBWT is not even valid for most data sets.
Look I am an old programmer not exactly sure how all routines work together MAKE is not my thing. But if your interested we could work together on this. The first step would be actually to see if using his source code and simple compiles can one actually build an executable that works. When I played with paq the hardest part what see how it flowed since I am more use to straight coding. But once I was able to actually get it to compile and run. Changing it was easy.
I would look at this as more of a test project. I would expect that the BWTS version might be longer since carrying around an index that has to be written to a file and never used. Its more of a proof of concept. Also since all the after stages have been
tuned with his version of a BWT it might be less tuned to a real BWTS version.
The only real value would be to show what BWT should have been when first invented instead of adding a useless index. Even you think of it as several short blocks based in the lyndon words. But its not, suppose there are 100 lyndon words in a large file spaced evenly. If you are taking BWT of something that has a bunch of " THE" you would expect that the SPACES would bunch together because of the " THE" every were in the file. Well that also happens in the BWTS the spaces from the " THE" will sort to same area of output file regardless of which lyndon word the string came from. But yes one of he lydon word boundaries could cut that string but not often.
When the next programmer comes along and writes BZIPX if he understands the difference between the many versions of BWT and BWTS he may decide to use BWTS instead. You have to admit that if BWTS was invented first no one would use BWT
I am not so sure because people care more about compatibility, speed, and compression ratio. BWTS saves a 4 byte index, but is it faster? It might be more interesting to try BWTS in a newer compressor like BSC or BBB. BSC is very fast, and BBB uses 1.25n memory which could be adapted to BWTS I think. Both are open source and compress much smaller than BZIP2.
Look I start and stop several things. But reach a point of boredom and then stop. I realize looking at BZIP2 code that much of it is familiar. And yes I tested some modified versions and it does seem to do the same transform for blocks that are a single lyndon word so that it would be possible to drop BWTS in the code and create a version that would compress the same or better than current BZIP2 while maintaining compatablity with the current version of BZIP2. What one could do is compress each block using BOTH his BWT version and the BWTS version when BWT shorter use it. When BWTS version shorter use it with the index value set to zero. When BWT returns index of zero its the same as BWTS so no problem except it would be SLOW and would not save much space at all.
However what I really want at this point is a standalone bijective compressor using BWTS and I do not have one on the net. I want to write one similar to BZIP2 but I first want to focus on the binary BWTS. I will write a simple ( for a binary 2 state ) version first. And will compress to a string of Fibonacci numbers first. I will use a modifed Yuta BWTS and UNBWTS since I like his coding style. I want to put first version on this site in a few days. It may only work for one block of a million bytes at first. But I will evolve it to use more than one block and an bijective entropy coder as last stage. When I get bored with this will move on to the 256 state BWTS and maybe higher. Look at will not be super fast. The first version will use my bit input output routines and only and the 2 state BWTS followed by a converting to the Universal numbers. Of courst it will be all bijective. Its a small start and I hope I can keep my word.
Some day I might mod BBB but I hope others get curious enough about BWTS to do it so I don't have to.
I don't think it is possible to make bzip2 BWTS and keep compatibility. I would just write a new compressor.
BBB is far from optimal. I wrote it to illustrate low memory BWT. A better compressor would replace quicksort() with libDivSufSort or MSufSort. Of course you could do this with BWTS if you divide the input into Lyndon words first. The context model could also be made faster by coding run lengths.
I like you but it is possible to make BZIP2 use BWTS and still be able to uncompress files created with regular BWT version of BZIP2. The reason is that the index when ZERO the BWTS and BWT the same. When you compress a block with BWTS just
set the index to zero all the time.
So on when you uncompress you check the index if ZERO use UNBWTS for that block since if made with BWT it would get same answer. If the index not ZERO you use the way that is currently in BZIP2. So it would work. Or I am not sure what you mean by incompatible.
If I could get paid by your company as a test project I would have higher motivation. They could use it as a test of my ability to do something. The pay could be ZERO if it fails to work. But since I would write all blocks using BWTS and still have the zero index It would most likely not compress as well as the current code. I would not want to code where I select which of the two to use and then use the best for each block because that would be a lot harder and very slow.
@encode
at the beginning of this tread you have compiled bzip 1.0.6
can you please try to compile a windows binary of the parallel implementation of bzip2
from http://compression.ca/pbzip2/ ?
the actually version is v1.1.5 (Jul 16, 2011)
(The output of this version is fully compatible with bzip2 v1.0.2 or newer.)
best regards
What about incorporating Mori's patch for bzip2 (which generally replaces Seward's sorting algorithm with libdivsufsort) into pbzip2?