Page 2 of 2 FirstFirst 12
Results 31 to 56 of 56

Thread: qad, new experimental BWT(qlfc) based compressor

  1. #31
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Quote Originally Posted by Matt Mahoney View Post
    qad 1.0 still gives me a 0 size compressed file in both 32 bit Windows and 64 bit Linux under Wine 1.6. The 32 and 64 bit Linux versions still complain it can't find libdivsufsort.so.3. I tried:

    qad cf enwik8 enwik8.qad
    qad c enwik8 enwik8.qad
    qad c2 enwik8 enwik8.qad
    And does program write out anything, or just exists? I have added there now some messages, and I was testing it under WinXP 32bit( not on such big file though, but both versions seem to have worked ), or with Wine 1.4. Also the Linux binary is now linked with libdivsufsort.

    And I made quick fix for fast mode decode, as decoded file was twice as large in previous version and added a bit history into contexts for the RC (for last 7 bits), the difference is quite small, it's not properly weighted now.
    Attached Files Attached Files
    Last edited by quadro; 24th June 2014 at 09:36.

  2. #32
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    The error message helps. (T3200, 2.0 GHz, 3 GB, Vista 32 bit)

    Code:
    C:\tmp\qad1.01\32bit>qad.exe c \res\enwik8 enwik8.qad
    
     Compressing (\res\enwik8 -> enwik8.qad)..
     Not enough free memory
    After more experiments, it seems the largest file qad can compress is 31 MB (prefix of enwik8 ). 32 MB runs out of memory. Memory usage in task manager shows about 400 to 560 MB depending on file size when it works.
    Last edited by Matt Mahoney; 24th June 2014 at 20:24.

  3. #33
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Thanks, I think I found problem. The solution is only partial for now though, there is a danger of buffer overflow on highly repetitive data, and I need to find a fix for this, other than to increase limit for sub-contexts .

    I allocated quite big buffer for statistics and didn't realized it will be so big, its more than 1GB overall, so I decreased this limit ten times, this buffer should drop accordingly. Also BWT computation buffer is reused now, and the memory req. should decrease overall, but it's still above 5n due to statistics, which have fixed size. I may decrease it, but it will probably slow down encoding.
    Attached Files Attached Files
    Last edited by quadro; 24th June 2014 at 23:45.

  4. #34
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Another maintenance release. I think I now handled the buffer overflow issue( due insufficient sub-context count ), and it should be possible to store this statistics buffer into BWT computation one. There is still imminence of underflow in the RC(not connected to this), mainly in 64 RC version, which I need to cope with somehow.

    Also I tried to remove a needles buffer rotation of default mode decode function or buffer flushing there, but since this change would be quite large, I only restricted this rotation on one RLE channel (out of two).

    BTW I found this two channel division or later context selection for RLE could work also for the ranks, but it would complicate decoding part for now and brought only a small gains.

    Also I been playing with those bit history predictions and wanted to employ some state counters of the used states, but this didn't worked out, I think it's because these statistics are indirect and those bits could be in fact very distant and hence may not add on precision. But maybe shortening of the history buffer could help for higher contexts(ranks), or to use some extra order 0+ statistics.
    Attached Files Attached Files
    Last edited by quadro; 25th June 2014 at 15:34.

  5. #35
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Just tried a very simple mix of 2 consecutive contexts. In contexts A and B:
    Code:
    So e.g. in context A:
    if( sA == 0 && sB ==0 ) cnt[ 0 ]++;
    if( sA == 0 && sB ==1 ) cnt[ 1 ]++;
    if( sA == 1 && sB ==0 ) cnt[ 2 ]++;
    if( sA == 1 && sB ==1 ) cnt[ 3 ]++;
    Where sA is the last coded bit of context A, sB last coded bit in context B and cnt is the corresponding counter.

    Code:
    Now e.g., if( ( cnt[ 2 ] / cnt[ 0 ] ) > ( cnt[ 3 ] / cnt[ 1 ] ) )
    Amend cum prob. of context A (from context B)
    Last condition Means if there is a higher chance that coded bit in context A will be 1, when last coded bit of context B was 0, rather than 1. And this seems to work. So I will try some more conditions and to spread it into more contexts.
    Last edited by quadro; 26th June 2014 at 11:35.

  6. #36
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    It seems these counters were possibly my case specific, so I have removed them.

    And as where I code first 3 ranks, as follows:
    rank 1: put 0 into context 2
    rank 2: put 1 into context 2, put 0 into context 3
    rank 3: put 1 into context 2, put 1 into context 3

    ( ranks are usually sorted by the freq. )
    When there goes 1 into context 3, I now try to favorize that there will be 1 in context 2 more,
    than if there was 0 in context 3.

    Gains are quite small, see below, even though I omitted prob. change for sub-contexts, on the same context, they are only for the other context ( each main context I have several subcontexts ) to to ease up weighting.


    More successful it seem I was to improve context selection for RLE.

    Previous formula was to choose RLE channel according rank >= 2nd RLE channel cont. segment length.

    And I changed it into rank + continuous rank segment length (while the RLE goes into 1st channel) >= length of cont. seg. of 2nd RLE channel and if there is any RLE into 1st channel, decrease rank segment length (while RLE goes to 1st channel) by 1.

    Both RLE channels are than coded by different RLE encoding.

    New context selection (enwik8, with 32bit RC cf mode)
    21120991 2 & 3 ctx state mix
    21123033 no mix

    Old context selection
    21132220 2 & 3 ctx state mix
    21134621 no mix

    I would probably not compare gains of this 2 & 3 state mix, sice after new context selection, contexts have a different length, its hard to judge it. Anyway in fast mode, where is no ctx. selection for RLE, mix gains seem better.
    Attached Files Attached Files
    Last edited by quadro; 27th June 2014 at 17:49.

  7. #37
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I reduced number of main buffer passes in default mode from 8 to 5, also both RLE channels now use mostly the same ( or close ) context, taking more advantage from latest changes. The default mode got somehow closer to fast mode(speed-wise), so I wonder if to keep it, or make some faster.
    Attached Files Attached Files

  8. #38
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Decoder now uses less memory on default mode and also rank contexts were split into two channels, like I did with the RLE before. It uses same formula, but instead of rank + .., there is some file-specific constant, I picked 0.

    Default mode still contain 5 passes (before final encoding), anyway before changing it, i.e. adding more loops for RLE. I think I should focus on better weighting, because current scheme doesn't have to fit every freq. distribution.

    enwik8 64bit RC cf
    21086834
    Attached Files Attached Files

  9. #39
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Another small update.

    I removed the previous constant from rank context selection. So now it uses instead a scaled length of the last continuous RLE segment (/2 because this length doesn't have to be always helpful, but I guess a better weighting should compensate it, i.e if RLE segment is small overall compared to the rest of the ranks), while RLE context selection now uses previous rank, instead of the actual one. Also I tried to play with some averaging, but it wasn't of much help.
    Attached Files Attached Files
    Last edited by quadro; 30th June 2014 at 14:26.

  10. #40
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Made a 64bit version, actually unlike emulated one, it increase the size of firstly compressed context by few bytes, usually 5 and I really don't know why is that, but at least it's faster. Also It's yet not possible to compress a file over 2GB.

    BTW I was trying to increase the number of RLE repetitions to improve compression, but it seems this do not help anymore, anyway I made slight RC tuning just to match better the latest changes.
    Attached Files Attached Files

  11. #41
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I tried to add order 0 statistics for RLE, by modifying RLE algorithm, and to choose a different context for ranks (or here a consecutive run-lengths) 2-15, anyway the difference seems quite negligible e.g.:

    For fast mode(enwik8 64 bit):
    21199186 (old RLE)
    21198811


    Default mode (normal RLE is quite off-loaded by a Tree structure)
    21073461 (old RLE)
    21073402

    It could be possible to modify Tree structure to more benefit order 0 statistics, but I think I will leave it as it is, probably not worthy the slowdown.

    Anyway it seem possible to improve prediction / context selection according ranks ( or just their metrics, or parts ) to be seen, due present tree structure. So I added a secondary context selection on a joined rank channels, that has been split before. Actually there could be some compression loss because of the split, but maybe such prediction could be applied also on first context selection, also I have not tried it yet for the RLEs part.

    Enwik8 64bit
    21069357 - secondary context selection according ranks to follow
    Attached Files Attached Files
    Last edited by quadro; 3rd July 2014 at 21:48.

  12. #42
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Added this look-ahead prediction also on RLE ( a tree based one ), I think for default bit RLE, it might be quite complicated. Prediction for branches is more approximate than the one for leaves. But maybe to sparsify the branches, while having a whole next rank or a bigger portion of it might help overall.

    Enwik8 cf 64bit emu
    21059192
    Attached Files Attached Files
    Last edited by quadro; 6th July 2014 at 09:14. Reason: Spelling

  13. #43
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Bugfix for the previous version (should be in next release), for a possible wrong state at the and of decoding:


    the line 707:
    - nC4 = nextChar ? DeBinRC( 4, ncdiv = ( ncdiv << 1 ) + ( l2 > ncdiv ) ) : nC4;
    + nC4 = nextChar ? DeBinRC( 4, ncdiv = ( ncdiv << 1 ) + ( l2 > ncdiv ) ) : 0;


    Also I tried to use a different context selection formula for every second byte, to increase this look-ahead prediction precision on context selection I added before, but the difference was only a few bytes and it was not always helpful.

    A better seems combination of a look-ahead prediction with previous history. So I OR-ed these predictions together, to get either 0 or 1 channel. Also it seems it may help to increase the buffer length, i.e use more bytes before or after current position, or to use some better function, i.e an average or so. I will see what will get to next release.
    Last edited by quadro; 7th July 2014 at 21:46.

  14. #44
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    New version is out.

    It brings, among various bugfixes i.e.header/footer parsing, a fix for escape symbol states, memory allocation fix for files with small alphabets.

    * Slightly improved context selection, last character statistics are taken into account as well.
    * Tiny reduction of memory requirements on decoder part.
    * New parameter v shows current program version.
    * Added a shell script for testing. Script 'q' creates a log file q.log, where you can watch online compression results i.e. with tail -f q.log. Script takes same parameters as program for compression, i.e: ./q c bib bib.cmp

    (If on your Linux dist. time command reports maximum resident set (memory use), as fold of system's page size, you can use q_time_fix script instead.)

    For the next release I'd like to add a completely new fast mode perhaps, so most fixes were for modes c and cf so far.
    Attached Files Attached Files

  15. #45
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Qad 1.11 fails on Game1 test of WCC2015!

  16. #46
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Thanks a lot for the report,

    It seems it is a very big file, if it's single file. Can you give me some more information or a link on those data?

    Compression or decompression fails? Although which method, cf / c, 32 / 64-bit?

    I am on mobile data limit and have quite limited RAM(2GB) only for testing,

    So maybe If you are on Linux, I'd be glad if you could send me output from command, "gdb ./qad" and "r c game1 game1.cmp" or a similar one for windows.

    Best regards
    Last edited by quadro; 26th May 2015 at 23:14.

  17. #47
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    GAME1 is single file TAR (for single file compressor) for 564.412.928 byte.
    PC for test is Intel Core i7 920 2.66ghz 6GB RAM x64 Win7
    Option is: qad c game1.tar output
    Last edited by Nania Francesco; 26th May 2015 at 23:48.

  18. #48
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I think I spotted a problem in range coder finalization, or the stop symbols. It's a bit more complicated because there is a lot of subcontext and I plan to change it further to increase statistics, it's not yet error prone, but I added a new mode, which outputs raw binary data without entropy coding, so lets see if it can help. Subcontext information is lost and I also stored there some contexts together to gain extra space, so it's not so compressible yet.

    It can be triggered with cb, cfb, or c2b option.

    I done some minor other changes, which I think should not affect RC finalisation so much.

    I.e. ranks 2+ have been split into two channels, according to the first rank, similar to as what I done before with RLE (or the 0 rank), this now follows first rank split. Their subcontexts (in these new channels) are chosen according to up to four previous ranks.

    There is also a bug fix for parameters parsing, if the name of compressed file started with '2' or 'f' characters before, wrong compression mode might've been selected.
    Attached Files Attached Files

  19. Thanks:

    Stephan Busch (2nd June 2015)

  20. #49
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    A little OT, I just wrote a BWT sort, it's mostly only code for trying a different sorting techniques, I will probably not include it into this project unless it will be a bit more stable. It does not handle any special cases yet (i.e. highly redundant data) etc,

    It consist from 3 byte radix sort, than from an optional one byte radix for larger segments. The rest of buckets or segments are handled by quicksort / insert sort. I think insert sort is close to optimal on mostly sorted data, but it still have the worst case scenario usually worse than optimal sorting networks, so maybe some dynamic or conditional blend of these two methods (insert sort, or something like conditional / dynamic sorting network) could bring some 2-3 % improvement for small buckets. I've tested reverting insert sort directions and it seems to help a bit, so it's not as adaptive as can be.

    Memory req. are slightly higher than divsufsort and with 4 cores, it's is about 25% slower than divsufsort (parallel), on not so redundant data, but it should not be a problem to modify it i.e. for 8 cores etc.

    Otherwise it's not really optimised now, and I hardcoded it for 4 cores. Also it may be worthy to use 2 byte radix on really large buckets, or recursion instead 3-byte radix etc etc, also I use there a lot of needless memory moves, I guess there.


    //Edit I changed this into 2 and 2-byte radix (optional), instead 3 & 1 or 3 & 2, it seems to save some memory and it's faster for some files. 3 byte radix seems to have quite long overhead.

    Here is code snippet:
    Code:
    #define RADIX_TBL_SIZE 256 * 256 * 4
    
    int cmpBranch( unsigned int b1, unsigned int b2, unsigned int depth = 2 ) {
        b1 += depth, b2 += depth; // skip (2 or 4 byte) radix sort processed data
        while( ( ( unsigned int *) &data[ b1 ] )[ 0 ] == 
        ( (unsigned int *) &data[ b2 ] )[ 0 ] ) { 
            b1 += 4, b2 += 4;
            if( b1 >= fileSize + 3 ) b1 -= fileSize;
            if( b2 >= fileSize + 3 ) b2 -= fileSize;
        }
        unsigned int p1 = ( ( unsigned int *) &data[ b1 ] )[ 0 ], 
                     p2 = ( ( unsigned int *) &data[ b2 ] )[ 0 ];
        return __bswap_32( p1 ) < __bswap_32( p2 ); // __constant_cpu_to_be32 endian conversion
    }
    
    
    void insertionSort( unsigned int * idx, unsigned int l, unsigned int p, 
                                                            unsigned int depth = 2 ) {
        unsigned int j, swp, i;
        for( unsigned int i = l + 1; i <= p; i++ )
        {
            j = i;
            while( j > l && cmpBranch( idx[ j ], idx[ j - 1 ], depth ) ) {
                swp = idx[ j - 1 ]; // top-down addition probably faster
                idx[ j - 1 ] = idx[ j ];
                idx[ j ] = swp;
                j--;
            }
        }
    }
    
    
    void quickSort( unsigned int * idx, unsigned int l, unsigned int p, unsigned int de = 2 ) {
        if( p - l <= 7 ) { insertionSort( idx, l, p ); return; }
        unsigned int lf = l, ri = p, mid = idx[ ( l + p ) >> 1 ], swp;
        do {
            while( ( idx[ lf ] != mid ) &&
                ( 1 == cmpBranch( idx[ lf ], mid, de /*depth of already sorted data*/ ) ) ) lf++;
            while( ( idx[ ri ] != mid ) &&
                ( 0 == cmpBranch( idx[ ri ], mid, de ) ) ) ri--; // { if( p - l == 1 ) return; ri--; }
            if( (lf <= ri) ) { swp = idx[ lf ];
                idx[ lf ] = idx[ ri ];
                idx[ ri ] = swp; lf += 1, ri -= 1; }
        } while( lf < ri );
          if( l < ri ) quickSort( idx, l, ri );
        if( lf < p ) quickSort( idx, lf, p );
    }
    
    
    #define BUCKET_QSORT( coreN, from, to ) \
        unsigned int * newIndex = (unsigned int *) malloc( maxSeg[ coreN ] * sizeof( unsigned int ) ), \
                     * cnt = (unsigned int *) malloc( RADIX_TBL_SIZE ); \
        for( int b = from; b < to; b += sizeof( unsigned int ) ) { \
            while( !( ( unsigned long *) &rdxSrt02[ b ] )[ 0 ] ) { if( b + 8 >= to ) break; b += 8; } \
            if( freq = ( ( ( unsigned int *) &rdxSrt02[ b ] )[ 0 ] ) ) { \
                freq -= lstFreq; \
                if( freq > 1 ) { \
                    if( freq > ( maxSeg[ coreN ] >> 8 ) ) { \
                        memset( cnt, 0, RADIX_TBL_SIZE ); \
                        for ( int r = 0; r < freq; r++ ) \
                            cnt[ __bswap_16( ( ( unsigned short *) &data[ bq[ lstFreq + r ] + 2 ])[ 0 ] ) ]++; \
                        int ttlFreq = 0, lfreq; \
                        for ( int r = 0; r < 256 * 256; r++ ) \
                            lfreq = cnt[ r ], cnt[ r ] = ttlFreq, ttlFreq += lfreq; \
                        for ( int r = 0; r < freq; r++ ) \
                            newIndex[ cnt[ __bswap_16( ( ( unsigned short *) \
                            &data[ bq[ lstFreq + r ] + 2 ] )[ 0 ] ) ]++ ] = bq[ lstFreq + r ]; \
                        lfreq = 0; \
                        for ( int r = 0; r < 256 * 256; r++ ) { \
                            if( cnt[ r ] - lfreq > 1 ) \
                                quickSort( newIndex, lfreq, cnt[ r ] - 1, 4 /*depth*/ ); \
                            lfreq = cnt[ r ]; \
                        }\
                        memcpy( bq + lstFreq, newIndex, freq << 2 ); \
                    }    else quickSort( bq, lstFreq, lstFreq + freq - 1 ); \
                } \
                lstFreq = freq + lstFreq; \
            } \
        } \
        free( newIndex ), free( cnt ); \
    
        for( long l = 0; l <= 6; l++ ) {  // trailing characters for beter alignment
            data[ size + l ] = data[ l ]; // (for BWT data buffer)
        }
    
        unsigned char * rdxSrt02 = ( unsigned char *) malloc( RADIX_TBL_SIZE );
        unsigned int * rdxSrt = ( unsigned int *) rdxSrt02, * bq = ( unsigned int *) bQlfc;
    
        memset( rdxSrt02, 0, RADIX_TBL_SIZE );
        #pragma omp parallel for
        for( unsigned int c = 0; c < fileSize; c++ ) {
            #pragma omp atomic
            rdxSrt[ __bswap_16( ( (unsigned short *) &data[ t ] )[ 0 ] ) ]++;
        }
    
        // prepare parallel sections borders and statistics for the 2 byte radix sort
        unsigned int freq, totalFreq = 0, parSec[ 4 ][ 2 ], maxSeg[ 4 ] = { 0 };
        memset( parSec, -1, 4 * 2 * sizeof( unsigned int ) );
        for(unsigned int j = 0; j < RADIX_TBL_SIZE; j += 4) {
            while( !( (unsigned long *) &rdxSrt02[ j ] )[ 0 ] ) { if( j + 8 >= RADIX_TBL_SIZE ) break; j += 8; }
            if( freq = ( (unsigned int *) &rdxSrt02[ j ] )[ 0 ] ) {
                ( (unsigned int *) &rdxSrt02[ j ] )[ 0 ] = totalFreq;
                if( totalFreq > ( ( fileSize >> 2 ) * 3 ) ) {
                    if( totalFreq < parSec[ 2 ][ 1 ] ) parSec[ 2 ][ 1 ] = totalFreq, parSec[ 2 ][ 0 ] = j;
                    if( freq > maxSeg[ 3 ] ) maxSeg[ 3 ] = freq
                } else if( totalFreq > ( fileSize >> 1 ) ) {
                    if( totalFreq < parSec[ 1 ][ 1 ] ) parSec[ 1 ][ 1 ] = totalFreq, parSec[ 1 ][ 0 ] = j;
                    if( freq > maxSeg[ 2 ] ) maxSeg[ 2 ] = freq;
                } else if( totalFreq > ( fileSize >> 2 ) ) {
                    if( totalFreq < parSec[ 0 ][ 1 ] ) parSec[ 0 ][ 1 ] = totalFreq, parSec[ 0 ][ 0 ] = j;
                    if( freq > maxSeg[ 1 ] ) maxSeg[ 1 ] = freq;
                } else if( freq > maxSeg[ 0 ] ) maxSeg[ 0 ] = freq;
                totalFreq += freq;
            }
        }
        // failed to split symbol distribution on sections for parallel sorting, 1 thread not optimal
        if( parSec[ 2 ][ 1 ] == -1 ) parSec[ 2 ][ 1 ] = fileSize, parSec[ 2 ][ 0 ] = RADIX_TBL_SIZE;
        for( int s = 0; s <= 1; s++ ) // for all sections
            if( parSec[ s ][ 1 ] == -1 )
                parSec[ s ][ 1 ] = parSec[ 2 ][ 1 ],  parSec[ s ][ 0 ] = RADIX_TBL_SIZE;
    
        // fill rank sort indexes on positions in out. buffer
        for( unsigned int t = 0; t < fileSize; t++ ) {
            bq[ rdxSrt[ __bswap_16( ( ( unsigned short *) &data[ t ] )[ 0 ] ) ]++ ] = t;
        }
        // quick sorts running in paralel on sections of unsorted data after 3-byte radix sort
        #pragma omp parallel sections
        {
            #pragma omp section
            {
                unsigned int lstFreq = 0, freq;
                BUCKET_QSORT( 0, 0, parSec[ 0 ][ 0 ] )
            }
            #pragma omp section
            {
                unsigned int lstFreq = parSec[ 0 ][ 1 ], freq;
                BUCKET_QSORT( 1, parSec[ 0 ][ 0 ], parSec[ 1 ][ 0 ] )
            }
            #pragma omp section
            {
                unsigned int lstFreq = parSec[ 1 ][ 1 ], freq;
                BUCKET_QSORT( 2, parSec[ 1 ][ 0 ], parSec[ 2 ][ 0 ] )
            }
            #pragma omp section
            {
                unsigned int lstFreq = parSec[ 2 ][ 1 ], freq;
                BUCKET_QSORT( 3, parSec[ 2 ][ 0 ], RADIX_TBL_SIZE )
            }
        } //sections
    
        { // change sorted indexes into sorted data
            unsigned int * dataSrtIdx = ( unsigned int *) bQlfc;
            #pragma omp parallel for
            for( unsigned int i = 0; i < fileSize; i++ ) {
                if( !dataSrtIdx[ i ] ) bwtIndex = i;
                dataSrtIdx[ i ] = data[ ( dataSrtIdx[ i ] + fileSize - 1 ) % fileSize ];
            }
            #pragma omp parallel for
            for( unsigned int i = 0; i < fileSize; i++ ) {
                data[ i ] = dataSrtIdx[ i ];
            }
        }
    
    
    inverse transform( forward / backward ):
    
            unsigned char * idxs = ( unsigned char *) malloc( fileSize << 2 );
            unsigned int * id = ( unsigned int *) idxs;
            if( idxs == NULL ) cout << "\n Not enough free memory", exit( 1 );
    
            //for each symbol to be decoded, sf and sf2 - symbol freq, relative or cumulative, (already stored in archive)
            id[ OutSize ] = sf[ decChar ] - sf2[ decChar ]; // inicialise inverse BWT table, backward tranform
            //bQlfc[ sf[ decChar ] ] = decChar, id[ sf[ decChar ] ] = OutSize++; // inicialise inverse BWT, forward tranform
    
            // inverse BWT backward
            for( unsigned int c = 0; c < OutSize; c++ ) {
                decode[ OutSize - c - 1 ] = bQlfc[ bwtIndex ]; // print decoded data / characters
                bwtIndex = sf2[ bQlfc[ bwtIndex ] ] + ( id[ bwtIndex ] /*& 0x00FFFFFF*/ ); // next character's index
            }
            //    inverse BWT forward
            //for( unsigned int c = 0; c < OutSize; c++ )  // print decoded data / character
            //    decode[ c ] = bQlfc[ bwtIndex ], bwtIndex = id[ bwtIndex ];
    Last edited by quadro; 9th June 2015 at 01:59. Reason: fixed a mistake

  21. #50
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Added a simple recursion. Each core splits the larger buckets by a 2-byte radix sort, until a certain depth is reached.

    I compared this with divsufsort on enwik8. Parallel divs. finished in 59s while this in 65s. Anyway it could be possibly improvement only for the given file.

    Memory allocation is done dynamically for each new pair of columns, to fit bucket size, which saves some memory, but is not probably that fast. Also every sorted part is recursively copied into all the buffers below, which is possibly not fastest too.

    Code:
    void rBuckQSort( unsigned int freq, unsigned int * prevIndex, unsigned int depth = 2 ) {
    
        unsigned int * newIndex = (unsigned int *) malloc( freq * sizeof( unsigned int ) ),
                     * cnt = (unsigned int *) malloc( RADIX_TBL_SIZE );
    
    
        memset( cnt, 0, RADIX_TBL_SIZE );
        for ( int r = 0; r < freq; r++ )
            cnt[ __bswap_16( ( (unsigned short *) &data[ prevIndex[ r ] + depth ])[ 0 ] ) ]++;
        int ttlfreq = 0, lfreq;
        for ( int r = 0; r < RADIX_TBL_SIZE / sizeof( unsigned int ); r++ )
            lfreq = cnt[ r ], cnt[ r ] = ttlfreq, ttlfreq += lfreq;
        for ( int r = 0; r < freq; r++ )
            newIndex[ cnt[ __bswap_16( ( ( unsigned short *)
            &data[ prevIndex[ r ] + depth ] )[ 0 ] ) ]++ ] = prevIndex[ r ];
        lfreq = 0;
        for ( int r = 0; r < RADIX_TBL_SIZE / sizeof( unsigned int ); r++ ) {
     if( cnt[ r ] - lfreq > 8192 && depth <= 6 ) // stop recursion at given depth 2, 4, etc.
                rBuckQSort( cnt[ r ] - lfreq, newIndex + lfreq, depth + 2 );
            else if( cnt[ r ] - lfreq > 1 )
                quickSort( newIndex, lfreq, cnt[ r ] - 1, depth + 2 );
            lfreq = cnt[ r ];
        }
        memcpy( prevIndex, newIndex, freq << 2 );
        free( newIndex ), free( cnt );
    }
    
    
    #define BUCKET_QSORT( coreN, from, to ) \
        unsigned int * newIndex = ( unsigned int *) malloc( maxSeg[ coreN ] * sizeof( unsigned int ) ), \
                     * cnt = (unsigned int *) malloc( RADIX_TBL_SIZE ); \
        maxSeg[ coreN ] >>= 8; \
        for( int b = from; b < to; b += sizeof( unsigned int ) ) { \
            while( !( ( unsigned long *) &rdxSrt02[ b ] )[ 0 ] ) { if( b + 8 >= to ) break; b += 8; } \
            if( freq = ( ( ( unsigned int *) &rdxSrt02[ b ] )[ 0 ] ) ) { \
                freq -= lstFreq; \
                if( freq > 1 ) { \
                    if( freq > maxSeg[ coreN ] ) { \
                        rBuckQSort( freq, bq + lstFreq, 2 ); \
                    } else quickSort( bq, lstFreq, lstFreq + freq - 1 ); \
                } \
                lstFreq = freq + lstFreq; \
            } \
        } \
        free( newIndex ), free( cnt ); \
    Input data buffer should be increase as well, because the recursive function does not check for end of buffer
    Code:
        for (long l = 0; l <= 16; l++) { // trailingcharacters for beter alignment
            data[ size + l ] = data[ l ];
        }
    Last edited by quadro; 8th June 2015 at 19:43.

  22. #51
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    It should now determine number of cores / processes automatically.

    I've also noticed that the last part, i.e. transforming indexes to BWT's last column take way too much time, so I change them now directly when they are ready, instead of copying same indexes recursively over and over again. It now outputs last BWT column instead. It's not yet ideal, because every output character is stored as an integer, because of parallel operations.

    But now it takes 61s on enwik8. I am working on a low memory req. version, that might find some better use there.

    Code:
    #include <stdlib.h>
    #include "omp.h"
    
    #define RADIX_TBL_SIZE 256 * 256 * 4
    
    unsigned int * bq; // main index buffer and output buffer
    unsigned int bwtIndex; // resulting index
    
    unsigned char * data, * bQlfc; // data - input data buffer; bQlfc - indexes
    unsigned int fileSize // input size
    
    int cmpBranch( unsigned int b1, unsigned int b2, unsigned int depth = 2 ) {
        b1 += depth, b2 += depth; // skip (2 or 4 byte) radix sort processed data
        while( ( ( unsigned int *) &data[ b1 ] )[ 0 ] == 
        ( (unsigned int *) &data[ b2 ] )[ 0 ] ) { 
            b1 += 4, b2 += 4;
            if( b1 >= fileSize + 3 ) b1 -= fileSize;
            if( b2 >= fileSize + 3 ) b2 -= fileSize;
        }
        unsigned int p1 = ( ( unsigned int *) &data[ b1 ] )[ 0 ], 
                     p2 = ( ( unsigned int *) &data[ b2 ] )[ 0 ];
        return __bswap_32( p1 ) < __bswap_32( p2 ); // __constant_cpu_to_be32 endian conversion
    }
    
    void insertionSort( unsigned int * idx, unsigned int l, unsigned int p, 
                                                            unsigned int depth = 2 ) {
        unsigned int j, swp, i;
        for( unsigned int i = l + 1; i <= p; i++ )
        {
            j = i;
            while( j > l && cmpBranch( idx[ j ], idx[ j - 1 ], depth ) ) {
                swp = idx[ j - 1 ]; // top-down addition probably faster
                idx[ j - 1 ] = idx[ j ];
                idx[ j ] = swp;
                j--;
            }
        }
    }
    
    void quickSort( unsigned int * idx, unsigned int l, unsigned int p, unsigned int de = 2 ) {
        if( p - l <= 7 ) { insertionSort( idx, l, p ); return; }
        unsigned int lf = l, ri = p, mid = idx[ ( l + p ) >> 1 ], swp;
        do {
            while( ( idx[ lf ] != mid ) &&
                ( 1 == cmpBranch( idx[ lf ], mid, de /*depth of already sorted data*/ ) ) ) lf++;
            while( ( idx[ ri ] != mid ) &&
                ( 0 == cmpBranch( idx[ ri ], mid, de ) ) ) ri--; // { if( p - l == 1 ) return; ri--; }
            if( (lf <= ri) ) { swp = idx[ lf ];
                idx[ lf ] = idx[ ri ];
                idx[ ri ] = swp; lf += 1, ri -= 1; }
        } while( lf < ri );
          if( l < ri ) quickSort( idx, l, ri );
        if( lf < p ) quickSort( idx, lf, p );
    }
    
    void rBuckQSort( unsigned int freq, unsigned int * prevIndex, unsigned int depth = 4, unsigned int abs = 0 ) {
    
        unsigned int * newIndex = (unsigned int *) malloc( freq * sizeof( unsigned int ) ),
                     * cnt = (unsigned int *) malloc( RADIX_TBL_SIZE );
    
        memset( cnt, 0, RADIX_TBL_SIZE );
        for ( int r = 0; r < freq; r++ )
            cnt[ __bswap_16( ( (unsigned short *) &data[ prevIndex[ r ] + depth ])[ 0 ] ) ]++;
        int ttlfreq = 0, lfreq;
        for ( int r = 0; r < RADIX_TBL_SIZE / sizeof( unsigned int ); r++ )
            lfreq = cnt[ r ], cnt[ r ] = ttlfreq, ttlfreq += lfreq;
        for ( int r = 0; r < freq; r++ )
            newIndex[ cnt[ __bswap_16( ( ( unsigned short *)
            &data[ prevIndex[ r ] + depth ] )[ 0 ] ) ]++ ] = prevIndex[ r ];
        lfreq = 0;
        unsigned int nfreq;
        for ( int r = 0; r < RADIX_TBL_SIZE / sizeof( unsigned int ); r++ ) {
            if( nfreq = cnt[ r ] - lfreq ) {
                if( nfreq > 8192 && depth <= 6 ) // stop recursion at given depth 2, 4, etc.
                    rBuckQSort( nfreq, newIndex + lfreq, depth + 2, abs + lfreq );
                else {
                    if( nfreq > 1 )
                        quickSort( newIndex, lfreq, cnt[ r ] - 1, depth + 2 );
                    for( int j = lfreq; j < cnt[ r ]; j++ ) {
                        if( newIndex[ j ] ) bq[ abs + j ] = data[ newIndex[ j ] - 1 ];
                        else bwtIndex = abs + j, bq[ abs + j ] = data[ fileSize - 1 ];
                    }
                }
            }
            lfreq = cnt[ r ];
        }
        free( newIndex ), free( cnt );
    }
    
    #define BUCKET_QSORT( coreN, from, to ) \
        maxSeg[ coreN ] >>= 8; \
        for( int b = from; b < to; b += sizeof( unsigned int ) ) { \
            while( !( ( unsigned long *) &rdxSrt02[ b ] )[ 0 ] ) { if( b + 8 >= to ) break; b += 8; } \
            if( freq = ( ( ( unsigned int *) &rdxSrt02[ b ] )[ 0 ] ) ) { \
                freq -= lstFreq; \
                if( freq > ( maxSeg[ coreN ] ) ) { \
                    rBuckQSort( freq, bq + lstFreq, 2, lstFreq ); \
                } else { \
                    if( freq > 1 ) \
                        quickSort( bq, lstFreq, lstFreq + freq - 1 ); \
                    for( int r = 0; r < freq; r++ ) { \
                        if( bq[ lstFreq + r ] ) bq[ lstFreq + r ] = data[ bq[ lstFreq + r ] - 1 ]; \
                        else bwtIndex = lstFreq + r, bq[ lstFreq + r ] = data[ fileSize - 1 ]; \
                    } \
                } \
                lstFreq = freq + lstFreq; \
            } \
        } \
    
        for( long l = 0; l <= 6; l++ ) {  // trailing characters for beter alignment
            data[ fileSize + l ] = data[ l ]; // (for BWT data buffer)
        }
    
        unsigned char * rdxSrt02 = ( unsigned char *) malloc( RADIX_TBL_SIZE );
        unsigned int * rdxSrt = ( unsigned int *) rdxSrt02; bq = ( unsigned int *) bQlfc;
    
        memset( rdxSrt02, 0, RADIX_TBL_SIZE );
        #pragma omp parallel for
        for( unsigned int c = 0; c < fileSize; c++ ) {
            #pragma omp atomic
            rdxSrt[ __bswap_16( ( (unsigned short *) &data[ t ] )[ 0 ] ) ]++;
        }
    
        // prepare paralel sections borders and statistics for the 2 byte radix sort
        unsigned int freq, totalFreq = 0, procs = omp_get_num_procs(), parSec[ procs + 1 ][ 2 ], maxSeg[ procs ];
        memset( parSec, -1, ( procs + 1 ) * 2 * sizeof( unsigned int ) );
        memset( maxSeg, 0, procs * sizeof( unsigned int ) );
        for(unsigned int j = 0; j < RADIX_TBL_SIZE; j += 4) {
            while( !( (unsigned long *) &rdxSrt02[ j ] )[ 0 ] ) { if( j + 8 >= RADIX_TBL_SIZE ) break; j += 8; }
            if( freq = ( (unsigned int *) &rdxSrt02[ j ] )[ 0 ] ) {
                ( (unsigned int *) &rdxSrt02[ j ] )[ 0 ] = totalFreq;
                for( int k = procs - 1; k > 0; k-- )
                    if( totalFreq > ( ( fileSize / procs ) * k ) ) {
                        if( totalFreq < parSec[ k ][ 1 ] ) parSec[ k ][ 1 ] = totalFreq, parSec[ k ][ 0 ] = j;
                        if( freq > maxSeg[ k ] ) maxSeg[ k ] = freq;
                    }
                if( totalFreq <= ( fileSize / procs ) && freq > maxSeg[ 0 ] ) maxSeg[ 0 ] = freq;
                totalFreq += freq;
            }
        }
    
        // failed to split symbol distribution on sections for parallel sorting? =1 thread, not optimal
        parSec[ procs ][ 0 ] = RADIX_TBL_SIZE, parSec[ 0 ][ 0 ] = parSec[ 0 ][ 1 ] = 0;
        if( procs > 1 ) {
            if( parSec[ procs - 1 ][ 1 ] == -1 )
                parSec[ procs - 1 ][ 1 ] = fileSize, parSec[ procs - 1 ][ 0 ] = RADIX_TBL_SIZE;
            for( int s = 1; s <= procs - 1; s++ ) // for all sections
                if( parSec[ s ][ 1 ] == -1 || parSec[ 1 ][ 0 ] == RADIX_TBL_SIZE )
                    parSec[ s ][ 1 ] = parSec[ procs - 1 ][ 1 ], parSec[ s ][ 0 ] = RADIX_TBL_SIZE;
        }
    
        // fill rank sort indexes on positions in out. buffer
        for( unsigned int t = 0; t < fileSize; t++ ) {
            bq[ rdxSrt[ __bswap_16( ( ( unsigned short *) &data[ t ] )[ 0 ] ) ]++ ] = t;
        }
    
        // quick sorts running in parallel on sections of unsorted data after 2-byte radix sort
        #pragma omp parallel
        {
            for( int r = 0; r < procs; r++ )
                #pragma omp sections nowait
                {
                    #pragma omp section
                    {
                        unsigned int lstFreq = parSec[ r ][ 1 ], freq;
                        BUCKET_QSORT( r, parSec[ r ][ 0 ], parSec[ r + 1 ][ 0 ] )
                    }
                }
        }
    
    
    inverse transform( forward / backward ):
    
    
            unsigned char * idxs = ( unsigned char *) malloc( fileSize << 2 );
            unsigned int * id = ( unsigned int *) idxs;
            if( idxs == NULL ) cout << "\n Not enough free memory", exit( 1 );
    
    
            //for each symbol to be decoded, sf and sf2 - symbol freq, relative or cumulative, (already stored in archive)
            id[ OutSize ] = sf[ decChar ] - sf2[ decChar ]; // inicialise inverse BWT table, backward tranform
            //bQlfc[ sf[ decChar ] ] = decChar, id[ sf[ decChar ] ] = OutSize++; // inicialise inverse BWT, forward tranform
    
    
            // inverse BWT backward
            for( unsigned int c = 0; c < OutSize; c++ ) {
                decode[ OutSize - c - 1 ] = bQlfc[ bwtIndex ]; // print decoded data / characters
                bwtIndex = sf2[ bQlfc[ bwtIndex ] ] + ( id[ bwtIndex ] /*& 0x00FFFFFF*/ ); // next character's index
            }
            //    inverse BWT forward
            //for( unsigned int c = 0; c < OutSize; c++ )  // print decoded data / character
            //    decode[ c ] = bQlfc[ bwtIndex ], bwtIndex = id[ bwtIndex ];
    Last edited by quadro; 12th June 2015 at 15:35.

  23. #52
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    238
    Thanks
    188
    Thanked 17 Times in 12 Posts
    Thanks - when will there be updated version? I'm using xp 32 bit on a 1.4Ghz 1core laptop with only 1.5G memory, compressing xml text files of 5Mbytes approx. What would be best option?

    John

  24. #53
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Thanks hopefully soon, some time this month.

    For a single core, I'd choose divsufsort, or radix sort may help too to split file into smaller chunks for later sorting i.e. comparative one. But I think 5MB file is not such problem with divsufsort, which does need 5n memory, but if you'd like to save some more memory, than you may try to split file on smaller parts and than to use i.e. merge sort or it's variants ( in case of divsuf. or least significant radix sort)

    I don't know how MSD radix above would perform with just one core. It should not always require merge sort, when file gets split, but it probably will get much slower overall with only 1 thread.

    If you decide to compress some bigger data, as a workaround for memory use, I use fast flash disk with swap partition on it, it's not fastest way, there is lot overhead to it. I think WinXP has also ability to use swap file, even though a size limited. Anyway with only 5MB files you'll probably not need this.

  25. #54
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I found that my code probably has problems on larger files. And unfortunately I have to postphone development, because of holidays right now.

    It seems the backward transformation can run also faster on 5n space. (Implementing idea from MergeTL
    i.e. joining last BWT column and rank / index buffer together, for the better seqential access.)

    It seems there is even faster algorithm, but more greedy on memory. Maybe it could be combined with other alg. too.
    http://www.cs.helsinki.fi/u/tpkarkka/publications/ccp2011.pdf

    https://helda.helsinki.fi/bitstream/...pdf?sequence=1
    people.inf.elte.hu/lukovszki/Publ/esa05.pdf
    Last edited by quadro; 22nd June 2015 at 19:28.

  26. Thanks:

    avitar (22nd June 2015)

  27. #55
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Alright, took a bigger pause now, here is a new version, not so aimed for speed. There should be also small improvement in compression.

    It uses combination of radix, bucket, quick and insertion sort above, instead of libdivsufsort, reducing memory management of sorting below 1,5N. Reconstruction still uses ~4N+, but this should get better in the future.

    github.com/mike-c/qad
    Attached Files Attached Files

  28. Thanks (3):

    Bulat Ziganshin (17th December 2016),encode (22nd December 2016),Stephan Busch (19th December 2016)

  29. #56
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    876
    Thanks
    474
    Thanked 175 Times in 85 Posts
    Quote Originally Posted by quadro View Post
    Alright, took a bigger pause now, here is a new version, not so aimed for speed. There should be also small improvement in compression.

    It uses combination of radix, bucket, quick and insertion sort above, instead of libdivsufsort, reducing memory management of sorting below 1,5N. Reconstruction still uses ~4N+, but this should get better in the future.

    github.com/mike-c/qad
    I have tested the 64-bit build so far and it crashes with f and f2.. testset was the app.tar from squeezechart.

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Experimental new ZCM file compressor!
    By Nania Francesco in forum Data Compression
    Replies: 264
    Last Post: 10th June 2018, 20:09
  2. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 03:47
  3. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 11:14
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  5. dcs-bwt - Experimental Burrows-Wheeler Compressor
    By Arkanosis in forum Data Compression
    Replies: 2
    Last Post: 25th June 2008, 04:53

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •