encode:
Did you try verifying decompression? I think that should detect most of errors.
encode:
Did you try verifying decompression? I think that should detect most of errors.
I've heard that OCed GPUs tend to make many calculation errors. Encode, did you try to decompress the file with weird size?
ADDED: Good timing.![]()
One of the error messages at overclocked GPU:
Randomly, under overclocked GPU, BSC produces a wrong compressed file. If try to decompress that file:Code:I:\>bsc e enwik9 enwik9.z -b32p -m8f This is bsc, Block Sorting Compressor. Version 3.0.0. 26 August 2011. Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@gmail.com>. Compressing enwik9(67%)[D:/Development/Tools/back40computing\b40c/util/spine.cuh , 112] Spine cudaFree d_spine failed: (CUDA error 30: unknown error) General GPU failure, please contact the author!
This takes a longer time, but BSC happily decompresses it!Code:I:\>bsc d enwik9.z e9 This is bsc, Block Sorting Compressor. Version 3.0.0. 26 August 2011. Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@gmail.com>. enwik9.z decompressed 196399168 into 1000000000 in 24.820 seconds.
Files after decompression:
2996E86FB978F93CCA8F566CC56998923E7FE581 *ENWIK9 (Original)
ECFC9019ACA1077E22527206FB1B7ED3CA0397AC *e9 (Decompressed)
- CRC-32 calculation routine for data integrity verification.
WTF???
Your GPU is unstable. Nothing special. Lower the clock or increase voltage.
I'm talking about that BSC should check data integrity as listed in its features. And here the decompressed file is not equal to the original (as confirmed with MD5 checksums above
Just informing the BSC author for possible bug. To make BSC/LIBBSC more stable and better
bsc computes Adler32 checksum of compressed data. I will fix this issue in next version.
After reviewing CUDA forum I found a lot of issue with overclocked GPU. GPU can work fine with 3d with some visual artifacts, but it can produce incorrect results on CUDA. Could you please test you PC using OCCT v3.1.0. It have a test for CUDA.
Enjoy coding, enjoy life!
Not a CUDA programmer myself, but maybe you could introduce two new experimental modes
- safe mode with DMR (dual modular redundancy)
- overclocker's mode with TMR (triple modular redundancy)
Redundancy of (at least some, presumably most GPU intensive) operations performed in parallel to:
- discover GPU problems and stop compression if results of each redundant operations aren't identical - safe mode
- allow "voting" in case of non identical results if at two of them are the same (three different results obviously mean error) - overclocker's mode
Well, it may be a huge overkill (and any gain from using GPU will be rather lost). Compression is not an example of safety-critical task.![]()
Or check if decompression is correct.
GRZipII MTF model:
bsc fast QLFC model:Code:qlfc : enwik8 | 100000000 byte | 22023940(22.0%) byte | 1.747 sec qlfc : mozilla | 51220480 byte | 16631752(32.5%) byte | 1.747 sec qlfc : webster | 41458703 byte | 6745872(16.3%) byte | 0.546 sec qlfc : nci | 33553445 byte | 1235898(03.7%) byte | 0.125 sec qlfc : samba | 21606400 byte | 4063574(18.8%) byte | 0.484 sec qlfc : dickens | 10192446 byte | 2370823(23.3%) byte | 0.203 sec qlfc : osdb | 10085684 byte | 2311308(22.9%) byte | 0.327 sec qlfc : mr | 9970564 byte | 2287976(22.9%) byte | 0.219 sec qlfc : x-ray | 8474240 byte | 3872735(45.7%) byte | 0.500 sec qlfc : sao | 7251944 byte | 4869905(67.2%) byte | 0.608 sec qlfc : reymont | 6627202 byte | 1024072(15.5%) byte | 0.078 sec qlfc : ooffice | 6152192 byte | 2703163(43.9%) byte | 0.250 sec qlfc : xml | 5345280 byte | 392929(07.4%) byte | 0.032 sec qlfc : interrup | 5134954 byte | 818903(15.9%) byte | 0.062 sec qlfc : enwik5 | 5000000 byte | 1266655(25.3%) byte | 0.093 sec qlfc : book1 | 768771 byte | 221223(28.8%) byte | 0.016 sec Global ratio: 4.12033216, Global time: 7.037 sec
Do we need a fast MTF model in bsc?Code:qlfc : enwik8 | 100000000 byte | 20928559(20.9%) byte | 2.324 sec qlfc : mozilla | 51220480 byte | 16149628(31.5%) byte | 1.935 sec qlfc : webster | 41458703 byte | 6478039(15.6%) byte | 0.859 sec qlfc : nci | 33553445 byte | 1219420(03.6%) byte | 0.218 sec qlfc : samba | 21606400 byte | 3991791(18.5%) byte | 0.499 sec qlfc : dickens | 10192446 byte | 2270321(22.3%) byte | 0.312 sec qlfc : osdb | 10085684 byte | 2254247(22.4%) byte | 0.297 sec qlfc : mr | 9970564 byte | 2221703(22.3%) byte | 0.312 sec qlfc : x-ray | 8474240 byte | 3799108(44.8%) byte | 0.499 sec qlfc : sao | 7251944 byte | 4718048(65.1%) byte | 0.593 sec qlfc : reymont | 6627202 byte | 991200(15.0%) byte | 0.125 sec qlfc : ooffice | 6152192 byte | 2585650(42.0%) byte | 0.374 sec qlfc : xml | 5345280 byte | 386820(07.2%) byte | 0.047 sec qlfc : interrup | 5134954 byte | 795626(15.5%) byte | 0.140 sec qlfc : enwik5 | 5000000 byte | 1216376(24.3%) byte | 0.141 sec qlfc : book1 | 768771 byte | 213966(27.8%) byte | 0.031 sec Global ratio: 3.98866593, Global time: 8.706 sec
Enjoy coding, enjoy life!
IMO, we don't need MTF. As you summarized already, speed impact is not that much while compression ratio much worser. Also, as you know current trend is multi-threading apps. And you did well already. I think, those facts satisfy the requirements.
BIT Archiver homepage: www.osmanturan.com
Why don't you try making a direct CM (maybe RLE+CM, or o1 huffman + CM) instead?
Also how about making a postcoder that doesn't need the whole block to start working?
And MTF is not really a fast algo anyway - I'd say it would be better to further optimize the qlfc speed if you need higher speed.
@gribok:
15th March 2011 00:12 you wrote:
---
I do work usually in Visual Studio, and for bsc 2.4.5 I have used combination of VS 2008 SP1 + Intel 11.1 Now I have upgraded my PC to VS 2010 SP1 and Intel Composer XE. And new builds of bsc is slower.
---
can you please try to compile with the new "beta" gcc 4.6 ?
http://sourceforge.net/projects/ming...4/gcc-4.6.1-2/
http://sourceforge.net/projects/ming....lzma/download
info is from
http://encode.su/threads/1368-Some-o...6463#post26463
gcc 4.6 has new optimization/support for
Intel Core 2 : -march=core2 and -mtune=core2
Intel Core i3/i5/i7 : -march=corei7 and -mtune=corei7
Intel Core i3/i5/i7 processors with AVX : -march=corei7-avx and -mtune=corei7-avx
may be it can give bsc 3.0 a little bit extra speed on Core2 / Corei7 ?!
best regards
ps:
if you think there is a need for very fast compression
- why not implement a ppmd - algorithm within bsc ?
(7zip seems to have a very fast and good implementation of PPMD = Dmitry Shkarin's PPMdH with small changes)
Last edited by joerg; 29th September 2011 at 10:05.
1. 7z actually includes both ppmd vH (for rar) and vI (for 7z, maybe for zipx)
2. Afaik they are not "implementations", but simple ports of original ppmd sources.
3. Latest version is vJ, which is better
4. There's no sense to integrate ppmd into bsc, because there's basically nothing that ppmd does better than BWT
yesterday nvidia releases the new CUDA 4.1 toolkit:
http://www.developer.nvidia.com/cuda-toolkit-41
@gribok:
will you build a new version of your wonderful bsc 3.0.0 with the new CUDA 4.1-tools ?
I am especially interested in a speedup for the modes "Sort Transform of order 5" and "Sort Transform of order 6"
blazerx wrote on 7th December 2011, 14:01:
http://encode.su/threads/1208-CUDA-G...ll=1#post27528
***
For optimal performance on nVIDIA CUDA enabled cards please specify the block switch to No. of Shaders /8 and threads to 512
for example:
GTX460 has 336 CUDA cores - Blocks=42 ..... GTX580 has 512 CUDA cores - Blocks=64
This should allow you to reach maximum performance and shave off some time to the default values.
***
i cant understand this to the end: how and where is it possible to specify this block switch ?
best regards
@gribok: thanks in advance - sounds wonderful
as time goes, bsc becomes more and more brilliant program. but there a few things that has potential for improvements:
1. afair, when i tested grzip on my old q6600@3.24GHz, it perfromed ST4 transform of 100 mb block in 0.5 seconds, and further MTF compression took 1.5 seconds. if numbers are still about the same, it would be fantastic to see some super-fast encoding algorithm, even if it makes compression 10-20% worse - it will just extend bsc usage to new areas
2. there is a well-known technique of huffman preprocessing data before BWT/ST. afaik, it just decreases number of memory accesses, improving speed/compression ratio
3. bzip2 had -s switch that decreased amount of memory used for compression and decompression, and BBB is also very efficient. for FreeArc it's important to have ability to decompress using minmum of memory: modern computers usually have 2 gb of memory, but i want to keep archives decompressible even on computers having 128 MB RAM. now this means that i should limit blocksize to 20-30 mb, and implementation of such option will allow me to use 50-100 mb bloks. the same will hold for advertising bsc as bzip2 replacement
Likely. I have prototype with grzip model in bsc.
Unlikely, because I am using external libraries for BWT.
Unlikely, because in bzip2 block is limited by 1MB, so you can pack bits and do other optimizations for low memory. For blocks >16MB this is not trivial.
Enjoy coding, enjoy life!
3. it's also implemented in open-source BBB, with a full description
1. hopefully it will be much faster than MTF reults you cited above. but overall, may be there some other methods for fast encoding? what needs most time in MTF code? MTF itself or Huffman or ? we just neeed the way to do this part faster, even with worser quality
Last edited by Bulat Ziganshin; 17th February 2012 at 22:56.
1. can you say please that requires most time - mtf, huffman or smth else?
3.bbb (Aug. 31, 2006) is a Big Block BWT (Burrows-Wheeler transform) compressor. It allows blocks as large as 80% of available memory...
bbb uses a memory efficient BWT. For compression, blocks are first context-sorted in small blocks and then merged using temporary files. For the inverse transform, instead of building a linked list, the program builds an index to the approximate location of the next node, then searches linearly for the exact location.
Last edited by Bulat Ziganshin; 18th February 2012 at 10:20.
Right. Then BBB models the BWT output by mixing order 1, 2, 4 indirect contexts. Problem with BBB is that it uses a naive sort that runs too slow on highly redundant input. I could fix that using divsufsort to sort the small blocks before merging. But instead I stopped supporting BBB and put my efforts into ZPAQ instead. zpaq -m2 uses divsufsort BWT followed by order 1-2 ISSE chain for better speed and similar compression to BBB. Blocks are also compressed and decompressed in parallel by separate threads. I did not implement the large memory model but Jan Ondrus has written a config file and preprocessor for 1 GB blocks (using 1.4 GB memory) which is posted on the ZPAQ page.
some bzip2 mistakes to avoid
http://www.bzip.org/1.0.5/bzip2-manu....5.html#limits
Does anyone have a link to the bsc v2.80 binaries ? (the latest without CUDA) ?
I would like to test the ST6-7-8 modes with and without CUDA acceleration, to see the difference.
The latest version (v3.0.0) seems always to use CUDA for ST7 -8 modes ; not sure about ST6 though.
Further, is there a way to limit the number of blocks that are parallel processed ? On my 6+6HT cores = 12 treads, BSC allocates about 13GB RAM if I use 250mb blocks...
I see only a switch -t, but this completely disables parallel block processing and slows down a lot.
Thanks in advance !
You can download old builds using direct link like this http://libbsc.web.officelive.com/Doc...-2.8.0-src.zip bsc implements ST7&8 on GPU only. ST5&6 has GPU and CPU implementations. You can use -G switch to control behavior. Currently there is no switch to control number of blocks running in parallel. I will probably add it in next release.
Enjoy coding, enjoy life!
BSC for me is very nice compressor!
Good job !
Regards! Francesco!
Gribok,
It seems the link you gave has only the sources for the older builds, not the compiled binaries....
http://libbsc.web.officelive.com/Doc.../bsc-2.8.0.zip gave me : 404 not found
I've tried with Firefox and Opera, same results.