Page 1 of 2 12 LastLast
Results 1 to 30 of 35

Thread: Multi-threaded compression

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts

    Multi-threaded compression

    Since it seems that multi-threading is all the rage by now, i'm willing to give it a try, probably with one of my oldest and simplest compressors. Maybe LZP2, maybe Zhuff, i don't know yet.

    The question here, is what would be the better interface to work on that topic ?

    I'm simply puzzled by the fact the multi-threaded interface seems OS specific. The "handle" object has no equivalence in Unix world. So i was wondering if it would be better to use some kind of layer, providing abstraction from OS.

    Looking around, i found 2 libraries which could help in making the code multi-OS :
    1) Boost, C++
    2) LZMA2, C, which seems also in use into FreeArc

    As some of you already dug into this matter, what would be your own recommendation towards multi-threading ?

    Regards

  2. #2
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Cyan View Post
    Since it seems that multi-threading is all the rage by now, i'm willing to give it a try, probably with one of my oldest and simplest compressors. Maybe LZP2, maybe Zhuff, i don't know yet.

    The question here, is what would be the better interface to work on that topic ?

    I'm simply puzzled by the fact the multi-threaded interface seems OS specific. The "handle" object has no equivalence in Unix world. So i was wondering if it would be better to use some kind of layer, providing abstraction from OS.

    Looking around, i found 2 libraries which could help in making the code multi-OS :
    1) Boost, C++
    2) LZMA2, C, which seems also in use into FreeArc

    As some of you already dug into this matter, what would be your own recommendation towards multi-threading ?

    Regards
    + Thread Building blocks
    + OpenMP
    Enjoy coding, enjoy life!

  3. #3
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You can try pthreads - it's tiny and not-bloated compared to bigger frameworks - there's a win32 port.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  4. #4
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    Or TinyThread++ and a less useful tool for this case would be ClanLib

    @toffer: is the win32 pthreads port good?
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  5. Thanks:

    Bulat Ziganshin (29th November 2013)

  6. #5
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    My requirement is to get fast code. Thus i'd more like to stick with C stuff which doesn't use thing like RTTI, exceptions, etc (a very good example is c++ iostreams vs. FILE-io). These add overhead. As a bonus you get portable code. When i want to make the code pretty i typically encapsulate the calls in some simple classes, that's not much work.

    http://sourceware.org/pthreads-win32/
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  7. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    Thanks for advises

    pthreads is also recommended by Shelwien, so it seems a good target, well known in *nix world. I'm slightly worried by the size of the win32 source port though.

    tinythread is aso a very good surprise; looks like a small wrapper above pthread & win32 handles. Its objective (mimic C+0x interface ) seems fairly relevant too. The only (very) minor downside is that it is C++.

    I agree with the requirement of fast code, this is an objective.
    On the other hand, for a first implementation, i'm also looking for simple interface.
    In this respect, only Boost seemed to provide something easily understandable up to now. But i'll look again, and maybe pthreads & tinythreads will be competitive enough for this objective too.

    Regards

  8. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    i think that anything more complex than openmp will be overkill for your task (quick conversion into multithreaded cod)

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    Btw, do you simply want to compress blocks in parallel, or maybe you have some ideas about how to split LZ into async parts?
    Either way though, actual threading would be the least problem probably - first you'd have to change your algorithm (and likely the archive format too)
    to let it do some processing out of order, and that doesn't really require any physical paralleling.

  10. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Thanks for advice. pthreads-win32 looks nice because I can use the same code for Windows and Linux. I wrote a simple test program in Windows to run a fixed maximum number of threads at one time by having each thread check a mutex protected counter before starting and wait on a condition variable until it is below the maximum. That's the application I need for a parallel version of ZPAQ. My plan is to compress or decompress each block in a separate thread but limit the number running at one time to prevent running out of memory. It doesn't need much context switching or communication so this will work perfectly. Well, almost. I'd rather not have an extra .dll file but there are some complications with static linking it seems.

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    Not sure if you care, but its possible to integrate the pthreads dll into .exe using my dllmerge utility -
    I tested it with toffer's m1 before and it worked.
    Also consider making a wrapper for CreateThread winapi (for windows) and pthread_create (for all) -
    having that and the "volatile" keyword is enough for simple multithreaded apps.

  12. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    Here's a rangecoder demo with thread-based async i/o:
    http://nishi.dreamhosters.com/u/rc_v6.rar
    It uses some minimalistic winapi wrapper (see thread.inc)

  13. #12
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for rc_v6 Shelwien!

    As to me, writing wrappers is the way to go. Because, threads are not so complicated under all OSes. In multithreaded apps, the main problem how to parallelize the algorithm. Even if you're thinking about simply block-wise processing, you have to think a bit more than serial processing.
    BIT Archiver homepage: www.osmanturan.com

  14. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Dumb question about dllmerge. How do I get the base and offset for rebase?

    Anyway, I want to make a parallel version of zpaq (pzpaq). The idea is to compress or decompress different blocks or files in different threads since they are independent. But there is a tradeoff because each dividing the file into independent blocks hurts compression because it loses context, and also memory must be shared between threads so each gets less. The idea is that the user specifies the number of threads allowed to run at one time (usually no more than the number of cores), and also a thread is not allowed to start if it would exceed a memory limit. This can be done by splitting the input into blocks no larger than 1/n (for n threads) if possible, and working on them from largest to smallest. (Not perfect, but perfect is an NP-complete problem). The main thread would decide when to start each worker thread. The worker would signal the main thread when it was done, while main waited for signals from active threads before starting the next one. After all the threads finished, main would pack all the independently compressed blocks into an archive or file.

    Here is how it would be done using pthreads. For this simple test, there are 10 worker threads but only 3 allowed to run at one time. Each thread runs for a variable time from 0 to 10 seconds. I have only tested it in Windows, but it ought to work in Linux as-is because it is all POSIX based.

    Code:
    /* tlim2.cpp - Test program to limit the number of threads
       running at one time
    
    To compile: g++ tlim2.cpp -lpthreadGC2
    You need these files from pthreads-win32 from
    http://sourceware.org/pthreads-win32/
      
      libpthreadGC2.a   in C:\mingw\lib or -L
      pthread.h         in C:\mingw\include or -I
      sched.h
      semaphore.h
      pthreadGC2.dll    in PATH (to run)
    
    pthreads tutorial: https://computing.llnl.gov/tutorials/pthreads/
    */
    
    #include <pthread.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    int count=0;  // number of threads running
    pthread_cond_t cv=PTHREAD_COND_INITIALIZER;  // to signal end of thread
    pthread_mutex_t mutex=PTHREAD_MUTEX_INITIALIZER; // protects count, cv
    
    // Call f and check that the return code is 0
    #define check(f) { \
      int rc=f; \
      if (rc) fprintf(stderr, "Line %d: %s: error %d\n", __LINE__, #f, rc); \
    }
    
    // burn CPU for t seconds
    void work(int job, int t) {
      printf("Starting job %d for %d seconds\n", job, t);
      clock_t start=clock();
      while (clock()-start<CLOCKS_PER_SEC*t) ;
      printf("Finished job %d in %d seconds\n", job, t);
    }
    
    // Worker thread
    void *thread(void *arg) {
    
      // Convert argument to thread number
      int n=(int)(ptrdiff_t)arg;
    
      // Work for some random time
      work(n, n*3%10);
    
      // Let controlling thread know we're done
      check(pthread_mutex_lock(&mutex));
      --count;
      check(pthread_cond_signal(&cv));
      check(pthread_mutex_unlock(&mutex));
    
      // Return NULL status
      return 0;  // or pthread_exit(0);
    }
    
    // Test thread scheduler
    int main() {
      const int N=10;  // number of threads
      const int limit=3;  // max count allowed
    
      // Start timer for total job
      clock_t start=clock();
    
      // Start N joinable threads
      pthread_t tid[N];
      pthread_attr_t attr;
      check(pthread_attr_init(&attr));
      check(pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE));
      for (int i=0; i<N; ++i) {
    
        // Wait until count < limit before starting next thread
        check(pthread_mutex_lock(&mutex));
        while (count>=limit) {
          check(pthread_cond_wait(&cv, &mutex));
        }
        ++count;
        check(pthread_mutex_unlock(&mutex));
        check(pthread_create(tid+i, &attr, thread, (void*)(ptrdiff_t)i));
      }
    
      // Wait until all threads finish
      void *status;  // returned by thread, not used
      for (int i=0; i<N; ++i) {  // The order doesn't matter
        check(pthread_join(tid[i], &status));
        printf("Joined thread %d\n", i);
      }
    
      // Report total time for job
      printf("Time %1.2f sec\n", double(clock()-start)/CLOCKS_PER_SEC);
      return 0;
    }
    Output looks like this when running on 2 cores. Task Manager shows 100% CPU most of the time so both cores are working.

    Code:
    C:\res\pthread>timer tlim2
    
    Timer 3.01  Copyright (c) 2002-2003 Igor Pavlov  2003-07-10
    Starting job 0 for 0 seconds
    Starting job 1 for 3 seconds
    Starting job 2 for 6 seconds
    Finished job 0 in 0 seconds
    Starting job 3 for 9 seconds
    Finished job 1 in 3 seconds
    Starting job 4 for 2 seconds
    Finished job 4 in 2 seconds
    Starting job 5 for 5 seconds
    Finished job 2 in 6 seconds
    Starting job 6 for 8 seconds
    Finished job 3 in 9 seconds
    Starting job 7 for 1 seconds
    Finished job 5 in 5 seconds
    Starting job 8 for 4 seconds
    Finished job 7 in 1 seconds
    Joined thread 0
    Joined thread 1
    Joined thread 2
    Joined thread 3
    Joined thread 4
    Joined thread 5
    Starting job 9 for 7 seconds
    Finished job 6 in 8 seconds
    Joined thread 6
    Joined thread 7
    Finished job 8 in 4 seconds
    Joined thread 8
    Finished job 9 in 7 seconds
    Joined thread 9
    Time 17.38 sec
    
    Kernel Time  =     0.015 = 00:00:00.015 =   0%
    User Time    =    31.481 = 00:00:31.481 = 180%
    Process Time =    31.496 = 00:00:31.496 = 181%
    Global Time  =    17.394 = 00:00:17.394 = 100%

  15. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Well I tested in Ubuntu Linux. I need to add "#include <stddef.h>" or else the compiler doesn't understand "ptrdiff_t". Other than that it works with one odd behavior: clock() runs twice as fast as it should when 2 or more threads are running. In other words, work(x, 10) runs for 5 seconds instead of 10 if another thread is running it at the same time. The program still reports that the whole job takes 17 seconds but really it takes 9. Using time(0) avoids this problem. The man page says that some systems, but not Linux, count the CPU time used by child processes (but no mention of threads). So this seems to be wrong. Windows doesn't have this odd behavior.

  16. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    > Dumb question about dllmerge. How do I get the base and offset for rebase?

    Well, in .exe, there's some memory allocation defined (including uninitialized
    space, so allocation size can be very different from exe size), and dllmerge
    puts dll's sections right after exe's sections, but with a quirk - there's an
    implicit PE header section, and dll's header is discarded.

    Thus, the rebase target for the dll is usually ImageBase+ImageSize-0x1000.
    I normally use other tools to look up these values, but its also possible
    to look it up with dllmerge itself - run it like

    dllmerge.exe precomp.exe precomp.exe nul >log

    and find these lines in the log:

    uint ImageBase = 00400000
    [...]
    uint szImage = 001B1000

    Thus, the dll's new base is 0x400000+0x1B1000-0x1000=0x5B0000

  17. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    > I need to add "#include <stddef.h>" or else the compiler doesn't understand "ptrdiff_t".

    I prefer to use a workaround instead - something like (((char*)p)-((char*)0)) is ptrdiff_t
    automatically, without stddef.h

    > clock() runs twice as fast as it should when 2 or more threads are running

    I also encountered systems where clock() returns time rounded to seconds, so
    I'm using this instead:

    Code:
    #ifdef __GNUC__
    
    #include <sys/types.h>
    
    #include <sys/time.h>
    int GetTickCount(void) {
      timeval t;
      gettimeofday( &t, 0 );
      return t.tv_sec*1000 + t.tv_usec/1000;
    }
    
    #else
    
    #ifndef _WINDOWS_
    extern "C" __declspec(dllimport) unsigned __stdcall GetTickCount( void );
    #endif
    
    #endif
    Also a very important point is that time can go back on multicore/SMP systems -
    it can really screw up some naive timeouts and such.
    Thing is, modern kernels estimate the time by reading cpu's TSC (with rdtsc),
    which can be out of sync on different cores, and, especially, different cpus

  18. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    > Anyway, I want to make a parallel version of zpaq (pzpaq). The idea
    > is to compress or decompress different blocks or files in different
    > threads since they are independent.

    It would be more interesting if you could add some statistical
    segmentation, like http://www.ctxmodel.net/files/PPMd/segfile_sh2.rar

    Also don't forget about async i/o - the speed impact can be significant
    sometimes, a lot of systems still have slow storage, especially notebooks
    and such. Btw it can change the whole archive format - especially things
    like out-of-order storing of compressed blocks.

    Then, another good thing about CMs is that you can unsync rangecoding
    and modelling - ie the model can store bits + probabilities, and
    rangecoder thread would actually encode it - as additional bonus that
    provides a way to "transparently" disable compression of block if it appears expanded
    (set a flag to replace all model's predictions with 0.5 and that's it).
    Going further that way, you can also split the submodels into separate threads
    (ie mixer thread receives two streams of input probabilities and produces a stream
    of mixed probabilities).

  19. #18
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    You can parallelize the prediction just fine, mix is largly independent, you just need a memory barrier on add, or use a ring-buffer for the queue and rely on implicit memory ordering. If granularity is very fine, use exponential spin-locks for yielding the mix-subprocesses (you use a worker-queue assigning the same thread each time to the same model, take advantage that you have L1*CPUs now). You can prototype this in 20 minutes with OpenMP. For example here you can dump out 11 issues in parallel:

    scm1.mix(m);
    scm2.mix(m);
    scm3.mix(m);
    scm4.mix(m);
    scm5.mix(m);
    scm6.mix(m);
    scm7.mix(m);
    scm8.mix(m);
    scm9.mix(m);
    scm10.mix(m);
    cm.mix(m);

    I once made a order-0 escape model which was decoupled from the arithmetic entropy-coder and it made 1.2x with spin-locks (because moving the interval-data made more impossible). I did not manage to do this successfully when the model-cost was very high (eg. PPM), because windows-task-switching sux big time and the spin killed the model (I could have put coder on lower priority but then other problems occur). You just have to try and not give up. It's possible to make so tight processes dance harmonically with each other, just needs a bit of patience.

  20. #19
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    (Slightly off-topic...)
    Shelwien, I've tested both RC v5 and RC v6. What's the reason huge speed-up of RC v6? I've tested on my laptop (i7 740Q 1.73 GHz, 8 threads);

    RC v5 -> ~3.5 MB/s
    RC v6 -> ~10 MB/s

    I've rougly read the code and main differences seem coroutines and multi-threading I/O. So, I can't decide each parts' contribution to speed impact. Could you drop some comments about it?
    BIT Archiver homepage: www.osmanturan.com

  21. #20
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    It's neither parallel IO nor coroutines (at least these gains will be significant smaller). It's about overlapping modeling and coding.

    The usual "model M -> (y,p) -> coding (code y with probability p)" queue is split:
    - the main model M outputs a sequence S of bits and probabilities,
    - S is sorted according to p,
    - S is assumed to be the data to be compressed
    S is coded *without* directly knowing the actual model M using a simple (secondary) model M'.
    Since M' only encodes/decodes values (y,p) as data elemtns it can happen without any information about M, thus S can be (de)compressed without touching M at all.

    Summarizing the main model M transforms the input data into S and it is coded in parallel (as soon as input is available) using M'.

    Cheers
    Last edited by toffer; 10th January 2011 at 18:03.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  22. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Here is a preview of what I plan to write. http://mattmahoney.net/dc/pzpaq.html (not linked on my website).

    Parallel ZPAQ will have a gzip-like interface. It will be able to compress multiple files in parallel, splitting them into blocks if needed (with some compression loss). It will decompress multiple files in parallel or multiple blocks within a single file.

    Compressing and decompressing ZPAQ blocks in parallel should be simple enough to do. Blocks are designed to be independent. libzpaq should be thread safe (not tested yet). There are complications when files span multiple blocks and have to be split and concatenated using temporary files. Splitting on context boundaries is nice but it is also important to divide evenly so all threads finish about the same time. The interface allows for future advancements like this.

    It is true that some ZPAQ components like CM and ICM can run in parallel but others like mixers and the arithmetic coder depend on their outputs so they have to synchronize on every input bit. Also, they are bounded mostly by memory access so threads wouldn't help much. (Of course parallel blocks have the same problem). The arithmetic coder can be pipelined for compression but not decompression, but the gain is small because most of the computation is in modeling. Coding is fast.

    I may add decompression optimization with an external compiler for g++ like zpaq does now. It won't be necessary for compression because all of the needed models will be built in and already optimized. It won't read config files or use external preprocessors. That's how it should be. It should detect .txt, .bmp, .jpg, .exe, etc. and select the right model without the user having to do anything. If you want to develop new compression algorithms, then that is what zpaq is for.

  23. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    @osmanturan:
    > I've rougly read the code and main differences seem coroutines and
    > multi-threading I/O. So, I can't decide each parts' contribution to
    > speed impact. Could you drop some comments about it?

    Yeah, but also v6 is compiled with IntelC, and .exe in v5 with MSC
    (and size opt instead of speed); v5 works via getc/putc; there's a 64-bit
    multiplication in v5's rangecoder, etc.
    So I'd suggest to compare v6 to http://nishi.dreamhosters.com/u/o01_v0.rar instead.

    @toffer:
    No, rc_v6 is not psrc, just a plain coder with threads for reading and writing data.

  24. #23
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Toffer: Thanks for detailed explaination. While reading your comment, I was thinking myself "it seems the code just o0-o1 static mixing, where were these parts?" Luckily Shelwien's comments appeared at the bottom of the page Thanks anyway.

    @Shelwien: Ok. I'll give it a try. But, currently all 8 threads are being burned by Fryrender. And it will take until morning, I guess. So, I can only test them tomorrow.
    BIT Archiver homepage: www.osmanturan.com

  25. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Update: pzpaq is working. Just compression now, and most options don't work yet. But I do have results for the Calgary corpus compressed to 14 separate files, max compression on a dual core T3200 under 64 bit Ubuntu. As you can see the speedup is not linear, about 1.6 for 2 cores (probably due to memory contention).

    Code:
    matt@matt-M-7301U:~/zpaq/calgary$ time ../pzpaq -3t 1 *
    
    real	0m24.047s
    user	0m20.290s
    sys	0m3.700s
    matt@matt-M-7301U:~/zpaq/calgary$ rm *.zpaq
    matt@matt-M-7301U:~/zpaq/calgary$ time ../pzpaq -3t 2 *
    
    real	0m14.946s
    user	0m25.220s
    sys	0m3.420s
    matt@matt-M-7301U:~/zpaq/calgary$ rm *.zpaq
    matt@matt-M-7301U:~/zpaq/calgary$ time ../pzpaq -3t 3 *
    
    real	0m15.421s
    user	0m25.430s
    sys	0m4.220s
    matt@matt-M-7301U:~/zpaq/calgary$ cat *.zpaq >x.zpaq
    matt@matt-M-7301U:~/zpaq/calgary$ ll *
    -rw-r--r-- 1 matt matt 111261 2004-12-17 10:09 BIB
    -rw-r--r-- 1 matt matt  22998 2011-01-11 00:06 BIB.zpaq
    -rw-r--r-- 1 matt matt 768771 2004-12-17 10:09 BOOK1
    -rw-r--r-- 1 matt matt 199465 2011-01-11 00:06 BOOK1.zpaq
    -rw-r--r-- 1 matt matt 610856 2004-12-17 10:09 BOOK2
    -rw-r--r-- 1 matt matt 128357 2011-01-11 00:06 BOOK2.zpaq
    -rw-r--r-- 1 matt matt 102400 2004-12-17 10:09 GEO
    -rw-r--r-- 1 matt matt  46585 2011-01-11 00:06 GEO.zpaq
    -rw-r--r-- 1 matt matt 377109 2004-12-17 10:09 NEWS
    -rw-r--r-- 1 matt matt  96352 2011-01-11 00:06 NEWS.zpaq
    -rw-r--r-- 1 matt matt  21504 2004-12-17 10:09 OBJ1
    -rw-r--r-- 1 matt matt   8925 2011-01-11 00:06 OBJ1.zpaq
    -rw-r--r-- 1 matt matt 246814 2004-12-17 10:09 OBJ2
    -rw-r--r-- 1 matt matt  56169 2011-01-11 00:06 OBJ2.zpaq
    -rw-r--r-- 1 matt matt  53161 2004-12-17 10:09 PAPER1
    -rw-r--r-- 1 matt matt  13931 2011-01-11 00:06 PAPER1.zpaq
    -rw-r--r-- 1 matt matt  82199 2004-12-17 10:09 PAPER2
    -rw-r--r-- 1 matt matt  21349 2011-01-11 00:06 PAPER2.zpaq
    -rw-r--r-- 1 matt matt 513216 2004-12-17 10:09 PIC
    -rw-r--r-- 1 matt matt  28232 2011-01-11 00:06 PIC.zpaq
    -rw-r--r-- 1 matt matt  39611 2004-12-17 10:09 PROGC
    -rw-r--r-- 1 matt matt  10402 2011-01-11 00:06 PROGC.zpaq
    -rw-r--r-- 1 matt matt  71646 2004-12-17 10:09 PROGL
    -rw-r--r-- 1 matt matt  12070 2011-01-11 00:07 PROGL.zpaq
    -rw-r--r-- 1 matt matt  49379 2004-12-17 10:09 PROGP
    -rw-r--r-- 1 matt matt   8400 2011-01-11 00:07 PROGP.zpaq
    -rw-r--r-- 1 matt matt  93695 2004-12-17 10:09 TRANS
    -rw-r--r-- 1 matt matt  13277 2011-01-11 00:07 TRANS.zpaq
    -rw-r--r-- 1 matt matt 666512 2011-01-11 00:07 x.zpaq
    I have to remove the *.zpaq files because the program doesn't do it yet while I'm developing.

  26. #25
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sorry, i didn't check the code. That makes me look like an idiot - my bad.

    @Shelwien:
    Last time we talked about that the speed gain was less notable?! Did anything change?
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  27. #26
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by toffer View Post
    Sorry, i didn't check the code. That makes me look like an idiot - my bad.
    No problem As a side effect, I've learned details about another Shelwien's idea Thanks.
    BIT Archiver homepage: www.osmanturan.com

  28. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    > Last time we talked about that the speed gain was less notable?! Did anything change?

    Well, rc_v6 is a stripped-down psrc, so no secondary model there, but primary
    model is the same. Stats are as follows, though these are plain builds - no PGO etc.

    Code:
    c.time d.time enwik8    c.time d.time enwik8bwt
    6.562s 6.844s 52210420  6.016s 6.578s 21337208 // o01_v0
    6.219s 7.625s 50774396  5.141s 6.406s 21265108 // rc_v6
    9.828s 6.500s 50416578  6.453s 4.719s 21133640 // psrc_04i3
    > As a side effect, I've learned details about another Shelwien's idea

    I guess I didn't explain it frequently enough

    Also unfortunately there's no 2x speed gain in psrc either
    (secondary model is slower than primary, so its not well balanced),
    but I think its good enough as a proof-of-concept... would continue
    after getting some real codec as a frontend.

  29. #28
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts

    7zip and multi-threading

    Shelwien
    Btw, do you simply want to compress blocks in parallel, or maybe you have some ideas about how to split LZ into async parts?
    That's a good question; i was keeping it for a later post, but since you asked...

    I'm interested in digging into this issue.
    Obviously, splitting input file into blocks looks the easiest way to achieve compression parallelism, but it comes with its problems, namely we lose the capability to find and use dependences between blocks.
    I guess this is no problem for block-based algorithm, such as excellent BSC, but for LZ (and probably others) there is definitely a loss of efficiency.
    Nonetheless, i will probably investigate this method since it is the easiest one.
    Hopefully, for low-depth LZ, such as Zip, LZ4 or Zhuff, this can be made negligible with large enough blocks. But for large-distance matchers such as Etincelle, this is a real issue.

    [Edit] made a small post here on block-based parallelism

    A second way to achieve parallelism is by breaking the algorithm into chained tasks. This is more complex, since it affects the algorithm, hence requires deeper rewriting.
    And that's not all, perfect parallelism is likely unrealizable, or very difficult to achieve.

    As an example, i could break Zhuff into 2 parts : the LZ matcher, and the entropy encoder (same for Etincelle by the way). While one core works on LZ, the other one can work on entropy. But is that an equal division of work ? Well, most certainly not ...
    Moreover, this may help for 2 cores systems, but what about 3 or 4 or more ?
    This is really much less generic.

    Anyway, any suggestion is welcomed, since it is a very interesting issue to discuss about.

    Btw, I'm looking for information on how 7zip achieves its multi-thread capability. I know the code is available, but really, this doesn't help much
    I made a small test with 7zip, compressing enwik8 with one and then two threads.
    Resulting files are nearly identical (24*861*333 vs 24*861*205), memory usage is almost identical too. And CPU usage is 'only' ~70% for a 2 cores system.
    So i guess this is something like the second "division of work" strategy.
    Last edited by Cyan; 14th January 2011 at 02:36.

  30. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,329
    Thanks
    209
    Thanked 1,003 Times in 528 Posts
    > I guess this is no problem for block-based algorithm, such as
    > excellent BSC, but for LZ (and probably others) there is definitely
    > a loss of efficiency.

    Maybe it seems obvious for block-based BWT to use block-based parallelism,
    but its actually not the case with bsc
    In fact its more like a good example of your "chained tasks" - even at
    decoding, there're qlfc, iBWT, unLZP (and probably more).
    Also qlfc consists of entropy coding and MTF, which in theory can be
    also separated (especially for encoding), and BWT/iBWT can be threaded too.
    Anyway, I just tested bsc 2.4.5 and couldn't force it to produce larger
    archive by disabling the threading.

    > But for large-distance matchers such as Etincelle, this is a real issue.

    Note that you still can use the "history" of specific thread, ie
    if thread0 processed blocks 0 and 4, you can reference block0 data
    from block4, and also "older" blocks from all threads, ie
    when thread0 would process block8 (after 0 and 4), thread1 would likely
    already finish block1, so thread0 could reference it too.

    Or, alternatively, you can just process different files in threads,
    but i/o likely would be a bottleneck.

    > As an example, i could break Zhuff into 2 parts : the LZ matcher,
    > and the entropy encoder (same for Etincelle by the way).

    Its actually more than you think. The rangecoder itself can work in 2+
    threads (like psrc which toffer described here; or even plain rc for encoding only).
    Then there's parsing optimization. And then it makes sense to add filters -
    the usual preprocessors and something like Bulat's rep.
    Filters + Data indexing (match finder) + parsing optimization + primary model +
    secondary model - that's already 5 threads, plus you'd need threads for
    reading and writing...
    So unless you're aiming for GPU, blockwise parallelism seems unnecessary
    for now

    > While one core works on LZ, the other one can work on entropy. But
    > is that an equal division of work ? Well, most certainly not ...

    Actually I think that such kind of load-balancing is overrated.
    Well, sure its cool if you can split your algo into 1000 threads,
    then it really makes sense to add a smart scheduler which would
    use up all the available computing resources.

    But we don't really have that much competition for multithreading
    compression algos at this point, so if you can speed up your
    processing by using other cores, that's already nice even if it
    can't fully load all the cores.

    Also hogging all the cpu resources is actually bad - you should leave
    enough for GUI at least. And imagine that you manually scheduled
    the tasks to fully load 4 cores, and then one core becomes busy
    with something else - in worst case it can be slower than no
    multithreading at all, if other threads would wait for the thread
    on loaded core to match up.

    > Moreover, this may help for 2 cores systems, but what about 3 or 4 or more ?
    > This is really much less generic.

    In LZ, there's probably enough work even for 6-8 cores, also don't forget that
    cores still share the memory bandwidth.
    Actually even a GPU version might make sense, if we'd switch to LZ78
    (ie building a dictionary, then separately encoding small blocks).

    > Btw, I'm looking for information on how 7zip achieves its
    > multi-thread capability.

    There's optimal parsing, entropy coding and async i/o, isn't that
    kinda enough for now?
    Lzma2 adds support for independent block compression though.

  31. #30
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    459
    Thanked 257 Times in 105 Posts
    Thanks for your comment Eugene.

    So, to summarize, you seem clearly in favor of dividing work into a chain of tasks, each stage being dealt with into its own thread (now, that sounds a lot like CPU pipeline btw...)

    OK, so that's probably something i'll try out on a future version of Etincelle.
    Etincelle is much too fast, so i cannot really slice work into many stages : there's too little to do. But i get the point : for stronger algorithms, involving heavier calculations, the work slicing can be refined further into multiple stages, improving opportunities to use multiple cores in parallel.

    Regards

Page 1 of 2 12 LastLast

Similar Threads

  1. QArchiver - A new multi platform graphical Archiver
    By Simon Berger in forum Data Compression
    Replies: 25
    Last Post: 8th April 2010, 00:00
  2. Multi-Volume Archives
    By osmanturan in forum Data Compression
    Replies: 12
    Last Post: 13th June 2009, 02:46
  3. Multi-threading motivation
    By Trixter in forum Data Compression
    Replies: 1
    Last Post: 10th September 2008, 06:18
  4. Replies: 11
    Last Post: 20th February 2008, 11:39
  5. Replies: 1
    Last Post: 18th April 2007, 20:36

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •