Results 1 to 23 of 23

Thread: Alba

  1. #1
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts

    Alba

    I made a new BPE compresser, named Alba.
    Text compression is better.
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts

    update

    Add new compression mode.
    Attached Files Attached Files
    Last edited by xezz; 5th February 2014 at 08:37.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I tested on LTCB and Silesia. The original compression modes c and c32768 work but C does not produce identical output on enwik9 and some of the Silesia files. The decompressed files are also the wrong size by a few bytes. I posted benchmark results for the correct output.

    http://mattmahoney.net/dc/silesia.html
    http://mattmahoney.net/dc/text.html#5269

  4. Thanks:

    just a worm (11th February 2014)

  5. #4
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    Thanks, Matt! Update now.

  6. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    It works now. I updated Silesia and LTCB.

  7. #6
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    Thank's Matt!
    I made a new version.
    Attached Files Attached Files

  8. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    Tried it out on the Silesia corpus. Looks like C compresses better than e most of the time.

    Code:
      Silesia dicke mozil   mr   nci ooff  osdb reym samba  sao webst x-ray  xml Compressor -options
    --------- ----- ----- ---- ----- ---- ----- ---- ----- ---- ----- ----- ---- -------------------
    103224331  5496 25622 5163  6435 4447  8690 2484  8909 7239 18705  8468 1559 alba 0.2 C
    103232868  5495 25625 5163  6433 4447  8690 2486  8918 7239 18702  8468 1562 alba 0.2 e
    103273433  5497 25638 5163  6436 4450  8700 2487  8917 7239 18712  8468 1562 alba 0.2 c
    I wonder if there is a way to automatically optimize block size.

    Edit: updated LTCB. http://mattmahoney.net/dc/text.html#5265
    e compresses enwik8 a little better than C, but a little worse on enwik9. Also on enwik9, compression crashes (free(): invalid pointer. Core dumped) when done but still produces an output file that decompresses correctly with no errors. This was compiling from source with gcc 4.8.1 -O3 in 64 bit Linux.
    Last edited by Matt Mahoney; 6th February 2014 at 22:41.

  9. #8
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    I wonder if there is a way to automatically optimize block size.
    I have an idea that is using suffix tree or enhanced suffix array.
    But memory requirement is too large.
    Simple method is flzp like selection.
    compression crashes (free(): invalid pointer. Core dumped)
    sorry, I updated. next header size estimation is optimized.
    Attached Files Attached Files
    Last edited by xezz; 7th February 2014 at 16:31.

  10. #9
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts

    v0.3

    alba 'E' is best compression.
    Attached Files Attached Files
    Last edited by xezz; 11th February 2014 at 05:16.

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I tested alba 0.3 on the Silesia corpus compressing with E but it crashed during decompression on most of the files. Tested with the supplied alba.exe in 32 bit Vista.

  12. #11
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    Oh my god... I have already developed next version.
    probably same bug lives in v0.4.
    sorry, please wait now.

  13. #12
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    E is changed to e. becouse old e is not good.
    c and C are modified.
    Attached Files Attached Files

  14. #13
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    This version is implemented dynamic block encoding.
    I have corrected the bad calculation in it.
    Attached Files Attached Files
    Last edited by xezz; 18th February 2014 at 17:17.

  15. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    alba cd enwik8 - decompression did not verify although the size was right. I tested alba-0.5.1 in 64 bit Ubuntu, both alba.exe in Wine and compiling from source with gcc 4.8.1 -O3

  16. #15
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    Thanks, Matt! Bug fixed.

  17. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts

  18. #17
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    Implemented the second BPE. Smaller is good Its block size.
    Max block size is up to 512KB for dynamic block encoding.
    But 'd' is required improvement yet.
    Code:
    // normal BPE + optimal BPE
    alba c infile outfile +C
    // optimal BPE(dynamic block, max 512KB) + optimal BPE(blocksize:1024)
    alba Cd512k infile outfile +C1024
    
    alba Cd enwik8 out +C1280
    size: 100000000 to 51238011
    Attached Files Attached Files
    Last edited by xezz; 14th March 2014 at 12:49.

  19. #18
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    v0.7, bug fixed. And made a v0.8.


    Code:
    // 2nd BPE encode 1st BPE data widthout pair table.
    alba c+c in out
    // 2nd BPE encode all 1st BPE data. It's slower than 'c+c'
    alba c,c in out
    
    
    // e[0-2] is encode level. It's for 'd' only. No need for 2nd BPE.
    alba cd:e2 in out
    
    
    // c[0-512] is max count of block spread. It's for 'd' only.
    alba cd:c15 in out
    
    
    // Example for large file
    alba Cd512k,C1280 in out
    alba Cd512k:e2 in out
    Attached Files Attached Files
    Last edited by xezz; 15th March 2014 at 12:32.

  20. Thanks:

    djr91908 (9th June 2019)

  21. #19
    Member
    Join Date
    May 2019
    Location
    World
    Posts
    1
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I am very intested in byte pair encoding and this program. Albba cleverly finds good dte entries.
    Options I would love to see (while maintaining best combination/brute force results of course):
    - Output the dictionary entries readable to a file
    - Limit max dictionary entries to a given number
    - Limit max entropy (depth) to a given number
    BPE is often used in old video games. Sometimes you need to find the best dtes for a huge given text, but you can only create x number of entries (in my case arround 60) and can only use x depth (8 here). The program must be clever enough to test which dtes would give the best result. e.g. "ie" is counted most, but skipping this entry would give other entries more weight (with more depth, like "es. N" . Can't explain it better.

  22. #20
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    184
    Thanks
    49
    Thanked 13 Times in 13 Posts
    I´ve tested alba-0.8 (maximal compression settings), but my hexadecimal testfile can´t be shrinked down to 80 % losslessly. Could it be possible to tweak it and achieve that ratio? Time si not important.

  23. #21
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    What is your setting? And please attach testfile.

  24. #22
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    184
    Thanks
    49
    Thanked 13 Times in 13 Posts
    I´ve tried brute force (best compression). Also, I´ve specified maximal blocksize manually, but it hurts comp ratio.
    Attached Files Attached Files
    Last edited by CompressMaster; 11th June 2019 at 19:37. Reason: typo

  25. #23
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    160
    Thanks
    31
    Thanked 63 Times in 39 Posts
    My best result:
    Code:
    alba e4096d512k:e2 jacmen_hex.txt out
    size: 1962419 to 1082458 (55.159372 %)
    This result is better than Re-Pair. So my answer is impossible!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •