Page 1 of 2 12 LastLast
Results 1 to 30 of 40

Thread: msufsort 4 update

  1. #1
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts

    msufsort 4 update

    An update on Version 4 of msufsort

    v4 is still under development but I don't ever seem to have time to work on it. So I'm posting the code as is for anyone who is interested. As is, it is a full suffix array construction algorithm but it is missing one of the algorithm's core features (induction sorting) so it won't perform well on a lot of streams at the moment.

    What is complete so far is:
    a highly tuned, 7-way multikey quicksort (tuned version from msufsort v3)
    a highly tuned, tandem repeat sort (tuned version from msufsort v3)
    cache friendly improved two stage (tuned version from msufsort v3)
    multi-threaded (new to v4)

    What is remaining on the road map is:
    opportunistic induction sorting (improved method of induction sorting from msufsort v3)
    multi-threaded second stage ITS (will be new to v4)
    first stage ITS direct to BWT (as is done with msufsort v3)


    MSufsort4 can easily out perform existing SACAs for the files that do not require induction sorting already. Performance will only improve when it is added (if I ever find the time). So as is, the code is an example of what can be achieved with a highly tuned multikey quicksort with tandem repeat. But, again, it is missing a key component so there are a lot of streams for which it will not do as well at the moment.

    Results for enwik8/9 on Intel® Core™ i7-3770 CPU @ 3.40GHz × 8 (It is unclear to me why divsufsort does worse with open mp enabled)

    Enwik8
    MSufSort 4 single threaded 6.15
    DivSufSort 2.0.2 single threaded 7.24
    MSufSort 4 multithreaded 3.41
    DivSufSort 2.0.2 multithreaded 9.74

    Enwik9
    MSufSort 4 single threaded 80.04
    DivSufSort 2.0.2 single threaded 90.16
    MSufSort 4 multithreaded 41.96
    DivSufSort 2.0.2 multithreaded 119.73


    Source is available at:
    https://github.com/michaelmaniscalco/msufsort

    - Michael

  2. Thanks (7):

    Bulat Ziganshin (21st September 2016),Cyan (21st September 2016),encode (8th October 2016),Mike (21st September 2016),samsat1024 (14th October 2016),Turtle (22nd September 2016),VadimV (1st October 2016)

  3. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    DivSufSort 2.0.2 single threaded 90.16
    DivSufSort 2.0.2 multithreaded 119.73
    mt times > st for both files

  4. #3
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    mt times > st for both files
    Yes, as I noted above. I am not sure why MT time as larger that ST. But I did confirm that MT test was in fact using all cores during the ssort phase for divsufsort while OPENMP was enabled. No idea why that is though.

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    In my tests m/t helps divsufsort: https://github.com/Bulat-Ziganshin/C...aswell-i7-4770 although improvement was smaller than in msufsort4

    you can download executables here: https://github.com/Bulat-Ziganshin/C...earch/releases

  6. #5
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Latest commit adds:
    - bwt from first stage of ITS
    - unbwt
    - multi-threaded initial 16 bit bucket sort
    - multi-threaded ITS
    - multi-threaded BWT construction

    Remaining to do:
    - Induction sorting

    These changes make msufsort4 completely multi-threaded with the exception of the right to left pass of the second stage of the ITS. Adding MT support for that phase simply isn't worth it because the overhead of MT outweighs the benefits MT provides. There is a minor bug that I have not had a chance to focus on yet with multi-threaded BWT but I'll get that next. I'm sure it's nothing too big. The code is a bit ugly at this point but I wanted to get these changes committed sooner rather than later.

    MSufsort is now more than 2x faster than DivSufSort with multi-theading enabled.

    Results for enwik8/9 on Intel® Core™ i7-3770 CPU @ 3.40GHz × 8 (divsufsort MT times are now accurate)

    Enwik8
    MSufSort 4 single threaded 6.12
    DivSufSort 2.0.2 single threaded 7.37
    MSufSort 4 multithreaded 2.66
    DivSufSort 2.0.2 multithreaded 5.37

    Enwik9
    MSufSort 4 single threaded 80.45
    DivSufSort 2.0.2 single threaded 90.63
    MSufSort 4 multi-threaded 32.87
    DivSufSort 2.0.2 multi-threaded 82.05


    Source is available at:
    https://github.com/michaelmaniscalco/msufsort

    - Michael

  7. Thanks (8):

    Bulat Ziganshin (24th September 2016),hexagone (24th September 2016),JamesB (24th September 2016),jibz (24th September 2016),Matt Mahoney (30th September 2016),samsat1024 (14th October 2016),Turtle (26th September 2016),VadimV (1st October 2016)

  8. #6
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Latest commit makes msufsort 4 completely multi-threaded for all phases of suffix array construction as well as for all phases of BWT. Msufsort 4 is now over 2.6x faster than DivSufSort for MT results on enwik8 and over 2.8x faster for MT results on enwik9.

    work remaining on version 4 roadmap is:
    - multi-threaded induction sorting

    Enwik8
    MSufSort 4 single threaded 6.10
    DivSufSort 2.0.2 single threaded 7.37
    MSufSort 4 multithreaded 2.06
    DivSufSort 2.0.2 multithreaded 5.37

    Enwik9
    MSufSort 4 single threaded 80.70
    DivSufSort 2.0.2 single threaded 90.63
    MSufSort 4 multi-threaded 28.44
    DivSufSort 2.0.2 multi-threaded 82.05


    Source is available at:
    https://github.com/michaelmaniscalco/msufsort

    - Michael

  9. Thanks (5):

    Bulat Ziganshin (7th October 2016),hexagone (7th October 2016),samsat1024 (14th October 2016),Turtle (8th October 2016),willvarfar (7th October 2016)

  10. #7
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Very nice

  11. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Just tried to compile it on windows. I had to remove #include <byteswap.h> in order to compile. Little research shows that you are using functions like __builtin_bswap32 that are always available. For better portability, you may prefer to use boost definitions.

    Also, it was successfully compiled without #include <pthread.h>. I'm glad to see that msufsort relies on C++ 11 thread support. You don't need to include pthread.h since you don't use any pthread_* functions. -lpthread may be required for linking on Linux, but that's a different story.

    So, i successfully compiled it with
    gcc version 6.2.0 (i686-posix-sjlj-rev0, Built by MinGW-W64 project)

    g++ -s -omsufsort64.exe -m64 -O3 -fno-stack-protector -funroll-loops -funsafe-loop-optimizations -static -std=c++11 -I. -IC:\Base\Compiler\boost_1_62_0 executable\msufsort\main.cpp library\msufsort\msufsort.cpp

    g++ -s -omsufsort32.exe -m32 -msse2 -Xlinker --large-address-aware -O3 -fno-stack-protector -funroll-loops -funsafe-loop-optimizations -static -std=c++11 -I. -IC:\Base\Compiler\boost_1_62_0 executable\msufsort\main.cpp library\msufsort\msufsort.cpp




    Single-threaded benchmark on i7-4770 with ddr3-1600:
    Code:
    C:\msufsort-master\src>msufsort64.exe b z:\e8 1
    ================================================================
    msufsort - version 4a-demo
    author: Michael A Maniscalco
    **** this is a pre-release demo ****
    **** this version is incomplete and lacks induction sorting ****
    ================================================================
    
    loaded 100000000 bytes
    computing burrows wheeler transform
    direct sort initial 16 bit sort time: 858 ms
    direct sort time: 4248 ms
    second stage right to left pass time: 1213 ms
    second stage left to right pass time: 2379 ms
    burrows wheeler transform completed - total elapsed time: 9021 ms
    test completed and results validated successfully
    
    
    C:\msufsort-master\src>msufsort64.exe b z:\e9 1
    ================================================================
    msufsort - version 4a-demo
    author: Michael A Maniscalco
    **** this is a pre-release demo ****
    **** this version is incomplete and lacks induction sorting ****
    ================================================================
    
    loaded 1000000000 bytes
    computing burrows wheeler transform
    direct sort initial 16 bit sort time: 8301 ms
    direct sort time: 64220 ms
    second stage right to left pass time: 17335 ms
    second stage left to right pass time: 29268 ms
    burrows wheeler transform completed - total elapsed time: 122317 ms
    test completed and results validated successfully
    I also tried to add -march=native to improve speed, but resulting x64 executable was 1% faster on enwik8 and 1% slower on enwik9


    For comparison, divsufsort compiled with the same gcc 6.2 runs ~5% slower:
    Code:
    C:\Compression-Research\app_bslab>bslab-gcc-x64 -b1024 -nolzp -bwt6,7 -norle -nomtf z:\e8
    [ 6] divsufsort         :  9.85 MiB/s,  9682.976 ms
    [ 7] divsufsort (OpenMP):  13.0 MiB/s,  7348.816 ms
    
    C:\Compression-Research\app_bslab>bslab-gcc-x64 -b1024 -nolzp -bwt6,7 -norle -nomtf z:\e9
    [ 6] divsufsort         :  7.49 MiB/s,  127353.205 ms
    [ 7] divsufsort (OpenMP):  9.41 MiB/s,  101388.230 ms




    Unfortunately, in m/t mode it goes into endless loop burning cpu after printing "computing burrows wheeler transform" (i tried 2 and 8 threads). well, "msufsort32.exe b msufsort32.exe 8" just crashes. I tried to inlcude pthread.h back, tried to compile with gcc 4.9.3, tried with -O2 as only optimization option, and even without any optimization options - it was the same.

    The only successful combination:
    C:\msufsort-master\src>msufsort32.exe b z:\e8 2
    direct sort initial 16 bit sort time: 400 ms
    direct sort time: 2879 ms
    second stage right to left pass time: 789 ms
    second stage left to right pass time: 2022 ms
    burrows wheeler transform completed - total elapsed time: 6336 ms
    test completed and results validated successfully
    everything else i tried with 32/64 bit executables, enwik8/9, 2/4/8 thread - goes into endless loop or, less frequently, crashes
    Attached Files Attached Files
    Last edited by Bulat Ziganshin; 7th October 2016 at 07:48.

  12. Thanks:

    michael maniscalco (7th October 2016)

  13. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    According to README, it looks like a faster divsufsort version: https://github.com/akamiru/libdivsufsort

  14. #10
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Hi Bulat,

    [update]
    Just a guess but I think this might be a stack overflow. Running the exe you provided under Wine produces a stack overflow during the first phase of sorting in the 'first_stage_its' method. A quick look online suggests that the default stack size for linux is 2MB but only 1MB for Windows. This function needs about 1MB of stack space as written. I'll change it to use the heap.
    [/update]

    Thanks for taking the time to test. I'm not really too surprised to see that there might be some issues with Windows. I have no access to a Windows box for testing. I'm sure that it's probably a very minor issue but at the moment I have no way to investigate. I'll try to get some access to windows for testing when I get a chance.

    - Michael
    Last edited by michael maniscalco; 7th October 2016 at 21:51. Reason: suggesting reason for windows issue

  15. #11
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Updated a few spots to be more cache friendly in the multi-threaded sections. Performance improvement is fairly big but would probably hurt on random data, although not too noticeably. Hopefully, the issue Bulat was encountering is resolved but I can't test on windows to verify my theory.

    New performance with cache friendly updates:
    msufsort 4 multi-threaded performance is now 2.88x faster on enwik8 and 3.29x faster on enwik9

    Enwik8
    MSufSort 4 single threaded 6.10
    DivSufSort 2.0.2 single threaded 7.37
    MSufSort 4 multi-threaded 1.86
    DivSufSort 2.0.2 multi-threaded 5.37

    Enwik9
    MSufSort 4 single threaded 80.70
    DivSufSort 2.0.2 single threaded 90.63
    MSufSort 4 multi-threaded 24.91
    DivSufSort 2.0.2 multi-threaded 82.05


    Source is available at:
    https://github.com/michaelmaniscalco/msufsort

    - Michael

  16. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    http://nishi.dreamhosters.com/u/msufsort4_0.7z
    Seems to work for book1 and enwik8, but fails for shorter files like msufsort.h
    Managed to compile it with gcc/mingw 5.10 and IntelC 17, but IC build crashes with >1 thread (or hangs forever).
    Also IC build's 1-thread result for enwik8 is 21s vs gcc's 18s.

  17. #13
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by Shelwien View Post
    "but fails for shorter files like msufsort.h"
    Damn you for using my own file against me! (^:
    I'm in the process of installing Windows and the needed tools to test under that environment now. Thanks for helping with early testing.

    - M

  18. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    yes, now it works with m/t:
    C:\Testing\!Compression\msufsort-master\src>msufsort64.exe b z:\e9 8
    ================================================== ==============
    msufsort - version 4a-demo
    author: Michael A Maniscalco
    **** this is a pre-release demo ****
    **** this version is incomplete and lacks induction sorting ****
    ================================================== ==============

    loaded 1000000000 bytes
    computing burrows wheeler transform
    direct sort initial 16 bit sort time: 1602 ms
    direct sort time: 17380 ms
    second stage right to left pass time: 2340 ms
    second stage left to right pass time: 7320 ms
    burrows wheeler transform completed - total elapsed time: 33032 ms
    test completed and results validated successfully

    C:\Testing\!Compression\msufsort-master\src>msufsort64.exe b z:\e8 8
    ================================================== ==============
    msufsort - version 4a-demo
    author: Michael A Maniscalco
    **** this is a pre-release demo ****
    **** this version is incomplete and lacks induction sorting ****
    ================================================== ==============

    loaded 100000000 bytes
    computing burrows wheeler transform
    direct sort initial 16 bit sort time: 171 ms
    direct sort time: 1270 ms
    second stage right to left pass time: 217 ms
    second stage left to right pass time: 583 ms
    burrows wheeler transform completed - total elapsed time: 2725 ms
    test completed and results validated successfully

    C:\Testing\!Compression\msufsort-master\src>g++ -v
    gcc version 4.9.2 (i686-posix-sjlj-rev2, Built by MinGW-W64 project)



    I confirm bug with m/t on tiny files (both 32 & 64 works this way):
    C:\Testing\!Compression\msufsort-master\src\library>msufsort64.exe b msufsort.h 1
    ================================================== ==============
    msufsort - version 4a-demo
    author: Michael A Maniscalco
    **** this is a pre-release demo ****
    **** this version is incomplete and lacks induction sorting ****
    ================================================== ==============

    loaded 48 bytes
    computing burrows wheeler transform
    direct sort initial 16 bit sort time: 2 ms
    direct sort time: 0 ms
    second stage right to left pass time: 0 ms
    second stage left to right pass time: 0 ms
    burrows wheeler transform completed - total elapsed time: 6 ms
    test completed and results validated successfully

    C:\Testing\!Compression\msufsort-master\src\library>msufsort64.exe b msufsort.h 8
    ================================================== ==============
    msufsort - version 4a-demo
    author: Michael A Maniscalco
    **** this is a pre-release demo ****
    **** this version is incomplete and lacks induction sorting ****
    ================================================== ==============

    loaded 48 bytes
    computing burrows wheeler transform
    direct sort initial 16 bit sort time: 71 ms
    direct sort time: 12 ms
    second stage right to left pass time: 32 ms
    second stage left to right pass time: 0 ms
    burrows wheeler transform completed - total elapsed time: 142 ms
    **** BWT ERROR DETECTED
    Attached Files Attached Files

  19. #15
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    I confirm bug with m/t on tiny files (both 32 & 64 works this way):
    Latest push fixes these issues.
    - Michael

  20. #16
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts

    MSufSort 4 - new high performance multithreaded BWT inversion algorithm

    MSufSort 4 is now fitted with a new high-performance/multithreaded inverse BWT algorithm which is ~5x-6x faster than the 'standard' inverse BWT that was previously used. I've been meaning to focus on inverse BWT for several years now but just haven't had the time. But it's been bothering me that decode speeds are so poor for BWT when the basic concept is so trivial. Now that I've re-introduced M99 which has very fast decoding performance it was bothering me to see the majority of the decode time spent in the inverse BWT so this was motivation to finally implement the ideas that I had been sitting on for so many years. I have other ideas to improve inverse BWT beyond what was implemented, however, they all involve exploiting long string matches in the original input (which isn't always true) whereas these new changes are data agnostic and should improve performance regardless of the nature of the data.

    The only remaining item on the MSufSort 4 road map is to add recursive improved two stage (which is pretty important really).

    - Michael


    updated results:

    MSufsort performance before improved inverse BWT (using enwik8 )

    1 thread:

    burrows wheeler transform completed - total elapsed time: 6913 ms
    inverse burrows wheeler transform completed - total elapsed time: 6626 ms


    MSufsort performance after improved inverse BWT (using enwik8 )

    1 thread:
    burrows wheeler transform completed - total elapsed time: 6876 ms
    inverse burrows wheeler transform completed - total elapsed time: 2631 ms

    2 threads:
    burrows wheeler transform completed - total elapsed time: 3620 ms
    inverse burrows wheeler transform completed - total elapsed time: 2572 ms

    3 threads:
    burrows wheeler transform completed - total elapsed time: 2696 ms
    inverse burrows wheeler transform completed - total elapsed time: 2183 ms

    4 threads:
    burrows wheeler transform completed - total elapsed time: 2248 ms
    inverse burrows wheeler transform completed - total elapsed time: 1839 ms

    8 threads:
    burrows wheeler transform completed - total elapsed time: 2153 ms
    inverse burrows wheeler transform completed - total elapsed time: 1660 ms

    16 threads (hyper threading):
    burrows wheeler transform completed - total elapsed time: 2149 ms
    inverse burrows wheeler transform completed - total elapsed time: 1263 ms



    M99 performance on enwik8 before changes:

    BWT completed - 45.091 MB/sec
    Encode completed (single threaded) - 54.0938 MB/sec
    Encoded size = 22275155 bytes. Compression = 22.2752%

    Decode speed (single threaded): 49.1331 MB/sec
    UnBWT completed - 14.5866 MB/sec



    M99 performance on enwik8 after changes:

    BWT completed - 44.7945 MB/sec
    Encode completed (single threaded) - 54.0632 MB/sec
    Encoded size = 22275155 bytes. Compression = 22.2752%

    Decode speed (single threaded): 49.0321 MB/sec
    UnBWT completed - 57.3122 MB/sec

  21. Thanks (4):

    Bulat Ziganshin (23rd September 2017),hexagone (22nd September 2017),Mike (23rd September 2017),spwolf (23rd September 2017)

  22. #17
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Increasing the number of partitions for the algorithm has a marked improvement on performance (bumping number of partitions in the inverse bwt to 256 by default ... from 32):


    1 thread:
    burrows wheeler transform completed - total elapsed time: 6876 ms
    inverse burrows wheeler transform completed - total elapsed time: 2364 ms

    2 threads:
    burrows wheeler transform completed - total elapsed time: 3620 ms
    inverse burrows wheeler transform completed - total elapsed time: 2307 ms

    3 threads:
    burrows wheeler transform completed - total elapsed time: 2696 ms
    inverse burrows wheeler transform completed - total elapsed time: 1842 ms

    4 threads:
    burrows wheeler transform completed - total elapsed time: 2248 ms
    inverse burrows wheeler transform completed - total elapsed time: 1580 ms

    8 threads:
    burrows wheeler transform completed - total elapsed time: 2153 ms
    inverse burrows wheeler transform completed - total elapsed time: 1374 ms

    16 threads (hyper threading):
    burrows wheeler transform completed - total elapsed time: 2149 ms
    inverse burrows wheeler transform completed - total elapsed time: 1064 ms <-- just shy of 100MB/sec!
    Last edited by michael maniscalco; 23rd September 2017 at 00:04. Reason: removed unexpected bolding of all text

  23. #18
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Michael, these are fantastic numbers.
    What are the specs of the computer you tested on ?
    Within a couple of weeks, we get an LZ based compressor with the compression ratio of BWT / CM and a BWT compressor (18 years in the making ?) with the decompression speed of an LZ one.
    Interesting times.

  24. #19
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by hexagone View Post
    Michael, these are fantastic numbers.
    What are the specs of the computer you tested on ?
    Within a couple of weeks, we get an LZ based compressor with the compression ratio of BWT / CM and a BWT compressor (18 years in the making ?) with the decompression speed of an LZ one.
    Interesting times.
    numbers were generated on an Intel® Core™ i7-3770 CPU @ 3.40GHz × 8

  25. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I patched a number of things which cause problems for windows builds
    (in particular, mingw gcc missing std::thread implementation).
    Also replaced the frontend with something that actually writes output.
    http://nishi.dreamhosters.com/u/msufsort4a_0.rar

    Otherwise, this version seems much more stable than previous one -
    I was able to compile and test both IntelC and gcc builds without much problems

    Results look like this:
    Code:
    i7 930 @ 2800
    
    152.507s:  divsufsort.exe c1 D:\tmp\enwik9 1  954
    148.544s:  divsufsort.exe d1 1 2              954
    
    182.256s:  msufsort4a.exe c D:\tmp\enwik9 1  1
    100.418s:  msufsort4a.exe d 1 2              1
    
    147.047s:  msufsort4a_gcc71.exe c D:\tmp\enwik9 1  1
    101.291s:  msufsort4a_gcc71.exe d 1 2              1
    
     64.210s:  msufsort4a.exe c D:\tmp\enwik9 3  4
     66.846s:  msufsort4a.exe d 3 4              4
    
     60.794s:  msufsort4a_gcc71.exe c D:\tmp\enwik9 3  4
     70.091s:  msufsort4a_gcc71.exe d 3 4              4
    (The first divsufsort uses OpenMP and thus uncertain number of threads that actually do something; let's say 8).

  26. Thanks:

    spwolf (23rd September 2017)

  27. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    More stats for same enwik9:
    Code:
    7820x @ 4500
    
     1:  92.594s 46.421s
     2:  52.719s 46.750s
     3:  39.031s 37.625s
     4:  32.625s 30.688s
     5:  28.422s 26.656s
     6:  26.687s 23.703s
     7:  25.281s 21.797s
     8:  25.187s 19.453s
     9:  24.062s 19.359s
    10:  24.172s 17.547s
    11:  23.515s 17.032s
    12:  24.828s 16.750s
    13:  23.859s 15.797s
    14:  23.640s 16.641s
    15:  23.937s 15.766s
    16:  36.672s 14.547s

  28. Thanks:

    spwolf (23rd September 2017)

  29. #22
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    This might be my consumer-grade DDR3 and 4 core limit, but I'm not getting close to 100MB/s.
    But for comparison sake I'll use my inverse bwt algo (https://github.com/loxxous/Jampack/b...a/bwt.cpp#L263) which relies mostly on OoOE but also benefits from threads:
    Code:
    msufsort4a.exe c enwik9 enwik9.bwt 16         : 50.2 seconds
    msufsort4a.exe d enwik9.bwt enwik9 16         : 43.6 seconds
    msufsort4a.exe d enwik9.bwt nul 16            : 31.7 seconds  (no IO write)
    jbwt.exe  c  enwik9 enwik9.bwt -t16 -b1000    : 153.2 seconds (using divsufsort)
    jbwt.exe  d  enwik9.bwt enwik9 -t16           : 36.0 seconds 
    jbwt.exe  d  enwik9.bwt nul -t16              : 26.1 seconds  (no IO write)
    Machine: i5-4690K @ 3.82Ghz, DDR3 @ 1333Mhz

    Msufsort4 is amazing in comparison to divsufsort when it comes to parallelism, but its new bwt decoder could possibly benefit from OoOE.
    When I last checked bsc's inverse bwt (bsc_unbwt_biPSI_parallel) was usually faster than my implementation by like 15% since it decodes 2 symbols in the same cache-line per look-up, but it's slower on a single thread since the index stream decoders were bound to threads.
    I would use the -g option in jbwt to try out gpu decoding on enwik9 but alas there isn't enough vram on a gtx 970 for that. I'll include a binary for anyone who wants to try it out.

    Michael what is your ram configuration? Inverse bwt is normally limited by ram throughput but your implementation seems to keep building up speed the more threads you throw at it; it's intriguing, might even be fitting for gpu.
    Attached Files Attached Files
    Last edited by Lucas; 23rd September 2017 at 09:09.

  30. #23
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    I get these numbers (i7-2600 @3.40GHz, 16GB RAM, Windows 7)

    Code:
    msufsort4a.exe c enwik9 enwik9.bwt 8        : 45.8 seconds
    msufsort4a.exe c enwik9 enwik9.bwt 16      : 45.9 seconds
    msufsort4a.exe d enwik9.bwt nul 1             : 81.4 seconds  (no IO write)
    msufsort4a.exe d enwik9.bwt nul 8             : 34.5 seconds  (no IO write)
    msufsort4a.exe d enwik9.bwt nul 16           : 31.9 seconds  (no IO write)
    
    jbwt.exe  c  enwik9 enwik9.bwt -t1 -b1000     : 141.5 seconds (using divsufsort)
    jbwt.exe  c  enwik9 enwik9.bwt -t8 -b1000     : 141.6 seconds (using divsufsort)
    jbwt.exe  d  enwik9.bwt nul -t1                      : 68.9 seconds  (no IO write)
    jbwt.exe  d  enwik9.bwt nul -t8                      : 24.3 seconds  (no IO write)
    jbwt.exe  d  enwik9.bwt nul -t16                    : 23.8 seconds  (no IO write)

  31. #24
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by hexagone View Post
    I get these numbers (i7-2600 @3.40GHz, 16GB RAM, Windows 7)

    Code:
    msufsort4a.exe c enwik9 enwik9.bwt 8        : 45.8 seconds
    msufsort4a.exe c enwik9 enwik9.bwt 16      : 45.9 seconds
    msufsort4a.exe d enwik9.bwt nul 1             : 81.4 seconds  (no IO write)
    msufsort4a.exe d enwik9.bwt nul 8             : 34.5 seconds  (no IO write)
    msufsort4a.exe d enwik9.bwt nul 16           : 31.9 seconds  (no IO write)
    
    jbwt.exe  c  enwik9 enwik9.bwt -t1 -b1000     : 141.5 seconds (using divsufsort)
    jbwt.exe  c  enwik9 enwik9.bwt -t8 -b1000     : 141.6 seconds (using divsufsort)
    jbwt.exe  d  enwik9.bwt nul -t1                      : 68.9 seconds  (no IO write)
    jbwt.exe  d  enwik9.bwt nul -t8                      : 24.3 seconds  (no IO write)
    jbwt.exe  d  enwik9.bwt nul -t16                    : 23.8 seconds  (no IO write)
    I wonder if I've tuned it too closely to the hardware on which it was developed. Specifically, the initial number of simultaneous inverse bwt partitions is fairly high @ 256 per thread. On machine with a smaller cache fewer partitions may increase cache hits but thats just a guess.

    A stronger next step isn't to tune to specific hareware but to replace reading the bwt directly to get the decoded symbol but instead to use a binary search on the array index to determine the symbol in that indexed location in the bwt. that will eliminate the random access reads on the bwt array in favor of a small table that fits in 1 kb of space. it should result in an overall increase in performance if the binary search is imllemented correctly using no branches and only bit shifting.

    I will try that over the next couple of days when i have time.

  32. #25
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Maybe a parameter to select the number of partitions would be helpful.

    On i7-7700K @4.20GHz, 32GB RAM, Ubuntu 16.04
    16 threads

    Code:
    enwik9
    loaded 1000000000 bytes
    computing burrows wheeler transform
    burrows wheeler transform completed - total elapsed time: 23278 ms
    inverse burrows wheeler transform completed - total elapsed time: 9807 ms
    
    enwik8
    loaded 100000000 bytes
    computing burrows wheeler transform
    burrows wheeler transform completed - total elapsed time: 1905 ms
    inverse burrows wheeler transform completed - total elapsed time: 959 ms
    Less than 1 sec for enwik8 inverse BWT on this computer.

  33. #26
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    Quote Originally Posted by hexagone View Post
    Maybe a parameter to select the number of partitions would be helpful.

    On i7-7700K @4.20GHz, 32GB RAM, Ubuntu 16.04
    16 threads

    Code:
    enwik9
    loaded 1000000000 bytes
    computing burrows wheeler transform
    burrows wheeler transform completed - total elapsed time: 23278 ms
    inverse burrows wheeler transform completed - total elapsed time: 9807 ms
    
    enwik8
    loaded 100000000 bytes
    computing burrows wheeler transform
    burrows wheeler transform completed - total elapsed time: 1905 ms
    inverse burrows wheeler transform completed - total elapsed time: 959 ms
    Less than 1 sec for enwik8 inverse BWT on this computer.
    Would you mind testing jbwt on that same computer? I'm curious if the performance gains of msufsort4a are architecture based or ram speed based.

    Edit:
    Whoops just noticed that pc is ubuntu based and I haven't built any binaries for that, here's the code for compiling to different OS's.
    Attached Files Attached Files
    Last edited by Lucas; 24th September 2017 at 03:32.

  34. #27
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    I tested JamPack.

    After patching the code (__min, __max undefined and a case mismatch include in utils.cpp), I can build the executable.
    However, the decompression crashes (core dump)

    Code:
     ./Jampack_x64 c /disk1/ws/enwik9 /tmp/enwik9.bwt -b1000 -t8
    Read: 1000.00 MB => 171.83 MB (17.18%) @ 3.52 MB/s
    Also, correct me if I am wrong but msufsort outputs a BWT with a single primary index. It looks like jbwt_64 writes several indexes (one per partition) to the output BWT, right?

    I will build jbwt next ...
    Last edited by hexagone; 24th September 2017 at 04:24. Reason: NVM: Found jbwt sources

  35. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    "same computer"
    Code:
    jbwt_x64.exe c enwik9 1  -t# -T -b1000
    jbwt_x64.exe d 1 2       -t# -T -b1000
    
         msufsort4a        jbwt_x64
     1:  92.594s 46.421s   90.000s 51.640s
     2:  52.719s 46.750s   90.063s 33.671s
     3:  39.031s 37.625s   92.813s 29.140s
     4:  32.625s 30.688s   93.687s 20.375s
     5:  28.422s 26.656s   92.266s 20.375s
     6:  26.687s 23.703s   92.922s 18.828s
     7:  25.281s 21.797s   92.750s 15.078s
     8:  25.187s 19.453s   90.813s 14.156s
     9:  24.062s 19.359s   90.110s 14.281s
    10:  24.172s 17.547s   91.266s 14.969s
    11:  23.515s 17.032s   92.891s 13.656s
    12:  24.828s 16.750s   93.000s 13.282s
    13:  23.859s 15.797s   92.531s 12.813s
    14:  23.640s 16.641s   92.719s 13.062s
    15:  23.937s 15.766s   93.266s 12.766s
    16:  36.672s 14.547s   91.984s 11.844s

  36. Thanks:

    Lucas (24th September 2017)

  37. #29
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    Quote Originally Posted by hexagone View Post
    Also, correct me if I am wrong but msufsort outputs a BWT with a single primary index. It looks like jbwt_64 writes several indexes (one per partition) to the output BWT, right?
    That is correct, each index is stored during the encode stage to make decoding easier (less computation to rebuild/find the i'th partition in bwt). The indexes are stored in a footer in the block, it's easier for the structural model to encode the indexes at the end since it has already built a model based on the data.

    Thanks for the test Shelwien. Note the -T option is only applicable when there are enough blocks available, if enwik9 was transformed with 250MB blocks for example, it would load all 4 blocks and try to decode all of them. -T would restrict it to one block at a time.

  38. #30
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    See jbwt inverse BWT numbers below for the same machine:

    Code:
    enwik9
    
    time ./jbwt_x64 d /tmp/enwik9.bwt /dev/null -t8
    Read: 1000.00 MB => 1000.00 MB (100.00%) @ 14.43 MB/s
    Completed in 69.33 seconds
    real    0m10.835s
    user    1m8.884s
    sys     0m0.440s
    
    time ./jbwt_x64 d /tmp/enwik9.bwt /dev/null -t16
    Read: 1000.00 MB => 1000.00 MB (100.00%) @ 14.43 MB/s
    Completed in 69.32 seconds
    real    0m10.816s
    user    1m8.892s
    sys     0m0.428s
    
    enwik8
    time ./jbwt_x64 d /tmp/enwik8.bwt /dev/null -t8
    Read: 100.00 MB => 100.00 MB (100.00%) @ 16.41 MB/s
    Completed in 6.10 seconds
    real    0m1.042s
    user    0m6.072s
    sys     0m0.024s
    
    time ./jbwt_x64 d /tmp/enwik8.bwt /dev/null -t16
    Read: 100.00 MB => 100.00 MB (100.00%) @ 15.61 MB/s
    Completed in 6.41 seconds
    real    0m1.047s
    user    0m6.352s
    sys     0m0.056s
    Look at the real time, the reported time must be the CPU time. A bit slower but pretty close.

Page 1 of 2 12 LastLast

Similar Threads

  1. Quo Vadis JPEG - an update
    By thorfdbg in forum Data Compression
    Replies: 8
    Last Post: 31st July 2012, 17:35
  2. MSufSort 4 is being developed :)
    By Piotr Tarsa in forum Data Compression
    Replies: 2
    Last Post: 29th December 2011, 02:05
  3. zpaq 1.02 update
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 10th July 2009, 00:55
  4. Ocamyd Update!
    By LovePimple in forum Forum Archive
    Replies: 2
    Last Post: 29th March 2008, 22:28
  5. Prob/counter update
    By encode in forum Forum Archive
    Replies: 12
    Last Post: 28th November 2007, 21:34

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •