Results 1 to 21 of 21

Thread: mk-bwts

  1. #1
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts

    mk-bwts

    I haven't had convenient access to my computer lately, but my BWTS transform tool received significant changes to boost performance and reduce memory requirements recently and these are on google code in the devel branch (search for mk-bwts). I have not done as thorough testing as I would like, but I've verified correct output on a fairly large set of files (and also verified the BWTS of the BWTS of these files, which tends to uncover edge cases). This should be considered pre-release quality for now.

  2. #2
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Is enwik9 still too big?

  3. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Good question. I have not tried it. It still allocates enough space to hold the ISA, which would be 4 GB for enwik9, but now the space is only used to hold the Lyndon word ranks, so ordinarily most of that space will be unused and the OS won't allocate physical pages. So it would likely work on a 64 bit system, at least. More tinkering is possible. If you try it, please post the result.

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    (Duplicate removed)

    It seems like even a suffix sort on enwik9 should be impossible on a 32-bit system without breaking into smaller blocks, so I guess 64 bits is a minimum requirement for this as well. The SA alone would use the entire virtual address space.
    Last edited by nburns; 8th September 2014 at 07:34.

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I got around to trying it on enwik9, and it promptly segfaulted. The reason for the segfault was that malloc refused to allocate 4 billion bytes for the suffix array (and my handling of that was not graceful). The devel branch up on google code now gives a more descriptive error message when that sort of thing occurs. I am going to have to look deeper into what conditions cause malloc to fail. The Ubuntu VM I use for development only gets 2G of physical RAM for now, so -- presumably -- giving it more would help. If Linux -- or, at least, Ubuntu -- will not allocate more than physical RAM, that might induce me to want to make some other changes to the code. There's not much I can do about the size of the suffix array, though, or needing to allocate 4x the size of the file. At least, nothing remotely approaching simple and straightforward.

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,487
    Thanks
    26
    Thanked 129 Times in 99 Posts
    What about explicit memory mapping of a 4 GB file?

  7. Thanks:

    nburns (31st October 2014)

  8. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    What about explicit memory mapping of a 4 GB file?
    I hadn't thought of that. That would require extra setup code, but, if you think it would help, maybe I will give it a try.

  9. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,487
    Thanks
    26
    Thanked 129 Times in 99 Posts
    I'm no expert and haven't tried memory mapped files for implementing allocations beyond RAM size. Create a proof of concept code and you'll know the truth :]

  10. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I'm no expert and haven't tried memory mapped files for implementing allocations beyond RAM size. Create a proof of concept code and you'll know the truth :]
    I appreciate that background information. I'm not sure when I might get around to giving it a full try. Here's a rough sketch of what that might look like (using primarily syscalls):

    1. get a temporary filename: e.g., mktemp(3)
    2. create and open the file: e.g., creat(2)
    3. expand the file to the required size: e.g., ftruncate(2)
    4. memory-map the file read/write: e.g., mmap(2)
    5. optionally close the file descriptor: e.g., close(2)
    6. if #4 resulted in a valid mapping, pass the pointer returned by mmap(2) as argument 2 to divsufsort()
    7. before the process exits, delete the temp file: e.g., unlink(2)

    That's how you might do it on a posix system, e.g. Linux or OSX. Those are obviously not native Win32 functions. My background is *nix and as far as Windows I'm pretty much tone-deaf.

    As is fairly obvious, that's a good bit more complex than just calling malloc(3). As with anything complex, there will probably be more than one way to do it with accompanying nuances and trade-offs. For instance, tmpfile(3)/mkstemp(3)/&c. appear to take care of #1, #2, #5, and #7 in one call.

    What annoys me about something like this is all the choices and trade-offs. If the process dies in some unexpected way, what's the risk of leaving behind a potentially-enormous and hard-to-find file as an artifact? Are races and security holes possible? It just opens up a whole new can of worms versus a simple malloc.

    Edit:
    I went ahead and wrote the code that I outlined in this post, because since I had already taken the time to think through it, there was little excuse not to (aside from being stubborn and quarrelsome ). It's up on Google Code in the devel-test-map-writable branch.

    It worked on enwik8 and a test input of 100M of zeroes (as expected). When I ran it on enwik9, the result was as follows:

    Code:
    [devel-test-map-writable]~/git/mk-bwts$ ./mk_bwts enwik9 
    Attempting to map temporary file: Cannot allocate memory
    Failed to allocate 4000000000 bytes for suffix array. Abort.
    [devel-test-map-writable]~/git/mk-bwts$
    As far as I can tell, that's the equivalent to malloc failing, so it appears that mmap'ing a disk file doesn't substantially change the picture. Of course, there are now more knobs and levers to tinker with, and the possibility exists that the experiment somehow failed to create the conditions it was intended to test. If anything changes, I'll post.
    Last edited by nburns; 31st October 2014 at 23:20.

  11. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Well, color me impressed...

    In the call to mmap(2), I changed MAP_PRIVATE to MAP_SHARED, and upon retrying mk_bwts on enwik9, it did not fail instantly as before. As I write, it's still inside the call to divsufsort trying to suffix sort 1 billion bytes of data. I'll post the result when it finishes, if (as expected) it finishes.

    I noticed that it appears that I only gave Ubuntu 1 GB of swap space. That explains why it would be impossible to allocate enough memory to suffix sort enwik9 using entirely virtual memory (d'oh!).

    Piotr's legend expands...

    Edit:
    It's been running for over three hours now. It's apparently still in divsufsort doing the initial suffix sort. I'll keep waiting...
    Last edited by nburns; 1st November 2014 at 01:58.

  12. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,487
    Thanks
    26
    Thanked 129 Times in 99 Posts
    I think it could run for a veeery long time.

    You can try bbb from Matt. It has the option of multiphase (?) sorting, ie sorting chunks and joining the results, so that only 1.25x memory is needed.

  13. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I think it could run for a veeery long time.

    You can try bbb from Matt. It has the option of multiphase (?) sorting, ie sorting chunks and joining the results, so that only 1.25x memory is needed.
    It's been almost 8 hours and it's still going. I could make it accept the suffix array as a command-line argument and memory map that (in addition to the original text). The main reason I've shied away from that is just that it makes the interface more complicated and harder to use.

    The upside would be that it would probably seem really fast, because it wouldn't get blamed for the time it takes to suffix sort.

  14. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    you are risking to get a big hole in the place where your swapfile resides. seriously. шэму killed two hdds just my utorrent and you make much worser thing. it's why malloc doesn't allow to alloc arrays larger than ram (at least on windows) - just fool-protection

  15. #14
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    you are risking to get a big hole in the place where your swapfile resides. seriously. шэму killed two hdds just my utorrent and you make much worser thing. it's why malloc doesn't allow to alloc arrays larger than ram (at least on windows) - just fool-protection
    My Windows8 PC has no problem fulfilling malloc calls that exceed the size of the RAM. And then thrashing the disk, freezing the entire OS, etc....

  16. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    you are risking to get a big hole in the place where your swapfile resides. seriously. шэму killed two hdds just my utorrent and you make much worser thing. it's why malloc doesn't allow to alloc arrays larger than ram (at least on windows) - just fool-protection
    I have a pretty good drive. It's a Samsung SSD. Hopefully it held up.

    Anyway, here is the final result:

    Code:
    [devel-test-map-writable]~/git/mk-bwts$ ./mk_bwts enwik9 
    Suffix sort time          3923.92
    Find Lyndon words time      1.41
    Fix sort order time        25.38
    Write BWTS time           130.08
    Total time                4080.79
    Note that that's cpu time and not wall-clock time.

    The challenge now is to make sure it worked. Yuta's unbwts fails with "Cannot allocate memory."

  17. #16
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    I have a pretty good drive. It's a Samsung SSD. Hopefully it held up.

    Anyway, here is the final result:

    Code:
    [devel-test-map-writable]~/git/mk-bwts$ ./mk_bwts enwik9 
    Suffix sort time          3923.92
    Find Lyndon words time      1.41
    Fix sort order time        25.38
    Write BWTS time           130.08
    Total time                4080.79
    Note that that's cpu time and not wall-clock time.

    The challenge now is to make sure it worked. Yuta's unbwts fails with "Cannot allocate memory."

    I am not sure why Yuta's code failed you with UNBWTS since the unbwts is really not that complex. If you look where I did my test using BWTS on enwik9 I think I used my 64bit version of if it. But I thought I tested more than one version of it before I put it on this site. under the heading BIJECTIVE BWTS-DC-FIB

  18. #17
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by biject.bwts View Post
    I am not sure why Yuta's code failed you with UNBWTS since the unbwts is really not that complex. If you look where I did my test using BWTS on enwik9 I think I used my 64bit version of if it. But I thought I tested more than one version of it before I put it on this site. under the heading BIJECTIVE BWTS-DC-FIB
    It failed (presumably) because it tried to allocate a 4-billion-byte array and my VM only had 2 GB RAM + 1 GB swap = 3 GB.

    I tried giving the VM 5 GB and rerunning unbwts and it started running, but it was swapping like crazy. I killed it.

    Ironically, unbwts seems to need more working memory than generating the suffix array.

  19. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I put the latest changes on Google Code in the devel-test-map-writable branch in case anyone else wants to try one larger than RAM.

  20. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    David,

    Do you happen to have the BWTS of enwik9 still lying around for comparison? Here are the crc32, md5, and sha-1 of the file I got:

    crc32: fc112fe8
    md5: 94af8971d20ad63d1f1a300738ad532e
    sha1: dcbee3e6c362a0f446f6b3eced14517961a66583

  21. #20
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Just wanted to update that David Scott responded to my request for hashes privately and they checked out OK.

  22. #21
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I have a new version that reduces the amount of memory it allocates to pretty much just space for the suffix array and that's it. Running it in cywin64 with 8 GB RAM, it handled enwik9 with no problem whatsoever:

    Code:
    ~/Downloads$ mk_bwts.exe enwik9
    Suffix sort time          131.77
    Fix sort order time         8.10
    Write BWTS time            60.15
    Total time                200.02
    The write time seems a bit sluggish compared to the Linux on the VM, which probably has something to do with the fact that I'm using putc(). But overall, it ran much better in Cygwin than in the VM, probably mostly because it can use the full 8 GB of DRAM the machine has. I suspect that the VirtualBox VM makes page faults more expensive, or something along those lines, too, because even enwik8 seemed to go faster despite not hitting memory limitations on either platform. I don't normally associate Cygwin with high performance, but I guess this shows it in a pretty good light (the code barely makes any syscalls, so, really, Windows is doing all the work).

Similar Threads

  1. BIJECTIVE BWTS-DC-FIB
    By biject.bwts in forum Data Compression
    Replies: 4
    Last Post: 13th February 2012, 01:36
  2. BWTS paper
    By willvarfar in forum Data Compression
    Replies: 2
    Last Post: 19th January 2012, 13:06
  3. Question about BWTS
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 1st May 2011, 22:01
  4. USING BWTS in BWTmix
    By biject.bwts in forum Data Compression
    Replies: 8
    Last Post: 9th March 2010, 02:00
  5. BWTS explanation?
    By TopQuark in forum Data Compression
    Replies: 5
    Last Post: 8th April 2009, 22:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •