Results 1 to 16 of 16

Thread: fpaq0p4B

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    http://shelwien.googlepages.com/fpaq0pv4b.rar
    http://shelwien.googlepages.com/fpaq0pv4B.htm
    Code:
     48.940Mbps 48.862Mbps | 100% 100% | fpaq0p (original) 
     56.590Mbps 54.367Mbps | 116% 111% | fpaq0p IntelC build 
     56.987Mbps 49.813Mbps | 116% 102% | fpaq0pv2 (original) 
     64.992Mbps 53.078Mbps | 133% 109% | fpaq0pv3 (original) 
     77.834Mbps 67.819Mbps | 159% 139% | fpaq0pv4 GCC 4.3.0 prof-gen build 
     89.309Mbps 74.677Mbps | 182% 153% | fpaq0pv4nc (blockwise eof encoding) 
     79.103Mbps 66.729Mbps | 162% 137% | fpaq0pv5 (original) 
     80.067Mbps 70.099Mbps | 164% 143% | fpaq0pv5 GCC 4.3.0 prof-gen build 
     81.834Mbps 74.656Mbps | 167% 153% | fpaq0pv4A (IntelC) 
     95.188Mbps 87.459Mbps | 194% 179% | fpaq0pv4Anc (IntelC) 
    106.537Mbps 97.165Mbps | 218% 199% | fpaq0pv4B (IntelC, not compatible with fpaq0p)
    1. Weird Matt's carryless rc replaced with sh_v1m port,
    which supports single-step renormalization, and also
    other optimizations, like 16bit i/o (IntelC somehow
    optimizes it using vectorization).

    2. Classes made more readable using templates.
    Well, actually not much code left from previous version.

    3. Direct win32 i/o implemented (bypassing stdio),
    but only visible effect seem to be the smaller executable size.

    4. Preprocessors implemented for classes translation into macros -
    couldn't find any other way for rc to use local variables for its
    state. This gives most of speedup. Other possible way could be
    adding rc vars as arguments to all the functions, but it would be
    too ugly and won't guarantee that compiler will properly optimize it.

  2. #2
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Thanks Shelwien!

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    Btw, does anyone know how to disable the gcc "anonymous struct" error? Its really inconvenient.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Shelwien
    Direct win32 i/o implemented (bypassing stdio),
    but only visible effect seem to be the smaller executable size.
    stdio translates to the same win32 calls. so it sghould be enough to use read/write operations instead of getc/putc. if you need more speed - use separate i/o threads

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    > stdio translates to the same win32 calls. so it should be enough to use
    > read/write operations instead of getc/putc.

    Not exactly. It also uses an extra file descriptor
    (FILE* != windows handle)
    So executable size with win32 i/o is 3-4k less than with stdio.

    Also I actually expected to gain something from CreateFile()
    flags (WRITE_THROUGH/NO_BUFFERING/SEQUENTIAL_SCAN etc), but
    it didn't help anyway.

    > if you need more speed - use separate i/o threads

    For single-core cpu that surely would be slower.
    Memory-mapped i/o might help though.

    But anyway I'm only interested in algorithmic optimizations.
    Well, including compiler tweaks.
    And actually I was just trying to prove (if only to myself)
    that my rc is better than Matt's

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Shelwien
    And actually I was just trying to prove (if only to myself)
    that my rc is better than Matts
    Your sources is very hard to read - these .INC files... Matts coder is *VERY* simple, also its efficient. Im not sure that it is possible to do something principally simpler and more efficient. Anyway, still the catch is in modeling. FPAQ0P-like model is very fast. Maybe, with ideas like multi-threaded entropy coder we catching the fleas...

  7. #7
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Can this improved coder be integrated into LPAQ?

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Quote Originally Posted by Shelwien
    Not exactly.
    well, i mean that if you read/fread in large blocks, its translate to the same win32 calls

    Quote Originally Posted by Shelwien
    For single-core cpu that surely would be slower.
    Memory-mapped i/o might help though.
    are you ever tried? i tested mm io on windows - its faster for reading from disk cache (because we avoid memcpy) but for reading from disk its slower! probably because read calls are optimized by windows with read-ahead while mm files are supposed to be accessed in random order

    OTOH, using async i/o calls or separate i/o threads allows to overlap compression and DMA transfers to/from disk. if your disk still use PIO mode, just upgrade your 386-based box to i486 or something newer

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    win32 calls are already async enough as they are.
    I mean, WriteFile() surely doesn't wait until block is written,
    and ReadFile() actually copies the data from cache... well, with
    reasonable block sizes.
    And caching system works in separate thread(s) anyway.

    But Read/Write operations at least have to copy the data
    into user's buffer - that's where having another core for i/o
    would help. But I don't see how a separate i/o thread
    could help with single cpu core.
    And btw, my dataset is only 64M so there's almost no disk i/o anyway.

    MM, on the other hand, is supposed to read the data into
    the same memory where you use it, avoiding the buffering.
    But I guess with it you have to implement some prefetching
    and delayed write commits, so that's where threading would
    be really useful.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    encode wrote:
    >> And actually I was just trying to prove (if only to myself)
    >> that my rc is better than Matt's
    >
    > Your sources is very hard to read - these .INC files...

    Believe me, it would be even harder to read if I merged it.
    Also there's no other way with so much replaceable parts.

    > Matt's coder is *VERY* simple,

    1. Only Matt's renormalization for encoding is simpler.
    And decoding in rcs with carry always was faster than carryless

    2. Visual simplicity doesn't really mean anything.
    Eg. do you understand why Matt's coder doesn't need any
    range cutting like in older carryless rcs?

    > also it's efficient.

    1. I _proved_ that my counterpart is more efficient,
    by testing both rc_mm and rc_sh in the same environment

    fpaq0pv4B.htm:
    43769221 6.030s 6.579s // fpaq0pv4Anc0
    43769372 5.811s 6.564s // best result for rc_mm
    43780374 4.857s 5.342s // best result for rc_sh

    In v4B, comparing to v4A, rc_mm version became faster too,
    but really not as much.

    2. Matts coder's drawback is that its renormalization
    always has to support all shifts up to 4 bytes, because
    even x1>x2 is possible. And because of the same reason
    it also can't be switched to 16bit granularity - it can't
    afford losing any more precision.
    Another thing is having the same renormalization in the
    decoder, but I already mentioned that.

    > I'm not sure that it is possible
    > to do something principally simpler and more efficient.

    Dunno about simplicity, but more efficient we do already have.

    > Anyway, still the catch is in modeling.
    > FPAQ0P-like model is very fast.

    I didn't change the model, only rewritten it with templates to
    reduce branch count and improve compiler optimizations.

    > Maybe, with ideas like multi-threaded entropy coder we catching the fleas...

    ???
    fpaq0p4B can be made faster by multi-threading, if you meant that.
    Thats a totally different direction from my optimizations.

    Black_Fox wrote:
    > Can this improved coder be integrated into LPAQ?

    Most of fpaq0pv3->fpaq0pv4B optimizations should be applicable.

  11. #11
    Member
    Join Date
    Apr 2007
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    hi shelwien,

    so tested different file access modes and it turned out the speed of posix vs win native unter windows is about 99% equal... and thats why i would go portable and use posix then... besides these two modes there is the memorymap mode... ist very easy to program but not faster... and the last mode is the async io... this is where you can realy gain advantage! but its much code to write for file operation then... and you have to do some threading by hand to fully gain from it.

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by eugene
    and you have to do some threading by hand to fully gain from it.
    why? just process file in blocks and make some function for management. asynchronous io is somewhat similiar to triple- buffered graphics output.

  13. #13
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Very nice job! HI !

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Shelwien
    I meant that MT is too heavy for simple entropy coder. Also, it is possible that RC not needs in such heavy stuff.

    According to results you did a nice coder! I will do an additional test with RC with carry on my BALZ - since we deal with asymmetric things here.

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    At the same time, comparing to pure FPAQ0P is not fair since you also improved other things like better EOF coding - no need in extra bit coding per each byte, plus you changed the I/O routines, and so on. So, 2X speed up is not only gained by the better RC as itself.


  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts
    The effect from bytewise eof removal was surprisingly small - only ~0.3s in my tests.
    Also i/o replacement wasn't that good either - well, it was for intel compiles, but v3 as a gcc prof_gen build was quite ok even with getc/putc.
    So what really helped was branch merging, model structure changes, removal of bit extraction/combining and other simplifications... and rc replacement.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •